Skip to content

Feat: support no traffic disruption during kmesh upgrade#1503

Merged
kmesh-bot merged 24 commits intokmesh-net:mainfrom
072020127:main
Oct 30, 2025
Merged

Feat: support no traffic disruption during kmesh upgrade#1503
kmesh-bot merged 24 commits intokmesh-net:mainfrom
072020127:main

Conversation

@072020127
Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

  • Implemented logic for storing and reading persistent MapSpecs
  • Added retrieval of the current MapSpec
  • Implemented comparison logic for Map structure
  • Added unit tests and e2e test programs

Which issue(s) this PR fixes:

Fixes #1409

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @072020127, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant feature enabling seamless, traffic-undisrupted upgrades for the kmesh-daemon. It achieves this by persisting BPF map specifications, comparing them during daemon restarts, and intelligently migrating map data to new versions. This ensures that ongoing network traffic is not interrupted when the daemon is updated.

Highlights

  • Persistent BPF MapSpecs: Implemented logic for storing and reading persistent BPF MapSpecs, which are crucial for maintaining state across kmesh-daemon upgrades.
  • MapSpec Comparison Logic: Added retrieval of the current MapSpec and implemented robust comparison logic for BPF Map structures to detect changes between versions.
  • Seamless Daemon Upgrades: Introduced an 'Update' mode for kmesh-daemon, allowing BPF programs and maps to be updated without traffic disruption by migrating existing map data.
  • Comprehensive Testing: Included new unit tests for MapSpec loading and comparison, as well as an end-to-end test to validate the traffic-undisrupted upgrade process.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


BPF maps persist, Upgrades flow, no traffic lost, Code ensures smooth ride.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant feature for non-disruptive upgrades of the kmesh-daemon. The core logic involves persisting BPF map specifications (MapSpecs), comparing them across versions, and migrating maps when their structure changes. The implementation adds new files for handling BPF updates and loading MapSpecs, along with extensive unit and e2e tests.

My review focuses on the new update logic. I've found a few critical issues related to error handling that could lead to panics, and a bug where changes in struct bitfield sizes are not detected, potentially causing data corruption. I've also identified areas with unhandled errors and code duplication that should be addressed to improve robustness and maintainability. The suggestions provided aim to fix these issues and refactor the code for better clarity.

Comment on lines +275 to +279
if map_old.Offset != uint32(map_new.Offset) {
log.Info(map_old.BitfieldSize, uint32(map_new.BitfieldSize))
diff.OffsetChanged = true
break
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The BitfieldSize of a struct member is not being compared. This is a critical bug, as a change in bitfield size alters the struct layout and should trigger a map migration to avoid data corruption. The existing log.Info call should be replaced with a proper comparison.

  if map_old.Offset != uint32(map_new.Offset) || map_old.BitfieldSize != uint32(map_new.BitfieldSize) {
   diff.OffsetChanged = true
   break
  }

pkg/bpf/bpf.go Outdated
Comment on lines +235 to +236
mapspec, _ := restart.LoadCompileTimeSpecs(config)
restart.SnapshotSpecsByPkg(mapspec)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The error returned by restart.LoadCompileTimeSpecs is ignored. This could lead to restart.SnapshotSpecsByPkg being called with a nil mapspec, which might cause a panic. The error should be checked and handled to prevent this.

Suggested change
mapspec, _ := restart.LoadCompileTimeSpecs(config)
restart.SnapshotSpecsByPkg(mapspec)
mapspec, err := restart.LoadCompileTimeSpecs(config)
if err != nil {
log.Errorf("failed to load compile time specs: %v", err)
return nil
}
restart.SnapshotSpecsByPkg(mapspec)

Comment on lines +66 to +70
specsByPkg, err := LoadCompileTimeSpecs(config)
if specsByPkg == nil {
log.Errorf("load oldSpecsByPkg failed")
return nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The error returned by LoadCompileTimeSpecs is not handled. If an error occurs, specsByPkg could be nil and the function might proceed, potentially causing a panic later. It's crucial to check and handle this error. Also, the log message on line 68 refers to oldSpecsByPkg which seems to be a typo.

Suggested change
specsByPkg, err := LoadCompileTimeSpecs(config)
if specsByPkg == nil {
log.Errorf("load oldSpecsByPkg failed")
return nil
}
specsByPkg, err := LoadCompileTimeSpecs(config)
if err != nil {
log.Errorf("load compile-time map specs failed: %v", err)
return nil
}

Comment on lines +138 to +233
func migrateMap(
oldMapSpec *PersistedMapSpec,
newMapSpec *ebpf.MapSpec,
pkgName, mapName, pinPath string,
) (*ebpf.Map, error) {
if oldMapSpec == nil {
return createEmptyMap(newMapSpec, pinPath, mapName, nil)
}
if oldMapSpec.Type != newMapSpec.Type.String() ||
oldMapSpec.KeySize != newMapSpec.KeySize ||
oldMapSpec.ValueSize != newMapSpec.ValueSize ||
oldMapSpec.MaxEntries != newMapSpec.MaxEntries {
oldMap, err := ebpf.LoadPinnedMap(pinPath, &ebpf.LoadPinOptions{})
if err == nil {
return createEmptyMap(newMapSpec, pinPath, mapName, oldMap)
} else {
return createEmptyMap(newMapSpec, pinPath, mapName, nil)
}
}

oldKeyStruct := oldMapSpec.KeyInfo
newKeyStruct, okNewKey := newMapSpec.Key.(*btf.Struct)
if okNewKey {
diff := diffStructInfoAgainstBTF(oldKeyStruct, newKeyStruct, make(map[string]bool))
if diff.Added || diff.Removed || diff.TypeChanged ||
diff.OffsetChanged || diff.NestedChanged {
oldMap, err := ebpf.LoadPinnedMap(pinPath, &ebpf.LoadPinOptions{})
if err == nil {
return createEmptyMap(newMapSpec, pinPath, mapName, oldMap)
} else {
return createEmptyMap(newMapSpec, pinPath, mapName, nil)
}
}
} else {
if newMapSpec.Key == nil {
if len(oldKeyStruct.Members) != 0 || oldKeyStruct.Name != "" {
oldMap, err := ebpf.LoadPinnedMap(pinPath, &ebpf.LoadPinOptions{})
if err == nil {
return createEmptyMap(newMapSpec, pinPath, mapName, oldMap)
} else {
return createEmptyMap(newMapSpec, pinPath, mapName, nil)
}
}
} else {
newKeyTypeName := newMapSpec.Key.TypeName()
if len(oldKeyStruct.Members) != 0 || newKeyTypeName != oldKeyStruct.Name {
oldMap, err := ebpf.LoadPinnedMap(pinPath, &ebpf.LoadPinOptions{})
if err == nil {
return createEmptyMap(newMapSpec, pinPath, mapName, oldMap)
} else {
return createEmptyMap(newMapSpec, pinPath, mapName, nil)
}
}
}
}

oldValueStruct := oldMapSpec.ValueInfo
newValueStruct, okNewValue := newMapSpec.Value.(*btf.Struct)
if okNewValue {
diff := diffStructInfoAgainstBTF(oldValueStruct, newValueStruct, make(map[string]bool))
if diff.Added || diff.Removed || diff.TypeChanged ||
diff.OffsetChanged || diff.NestedChanged {
oldMap, err := ebpf.LoadPinnedMap(pinPath, &ebpf.LoadPinOptions{})
if err == nil {
return createEmptyMap(newMapSpec, pinPath, mapName, oldMap)
} else {
log.Info(diff.Added ,diff.Removed,diff.TypeChanged ,
diff.OffsetChanged,diff.NestedChanged)
return createEmptyMap(newMapSpec, pinPath, mapName, nil)
}
}
} else {
if newMapSpec.Value == nil {
if len(oldValueStruct.Members) != 0 || oldValueStruct.Name != "" {
oldMap, err := ebpf.LoadPinnedMap(pinPath, &ebpf.LoadPinOptions{})
if err == nil {
return createEmptyMap(newMapSpec, pinPath, mapName, oldMap)
} else {
return createEmptyMap(newMapSpec, pinPath, mapName, nil)
}
}
} else {
newValTypeName := newMapSpec.Value.TypeName()
if len(oldValueStruct.Members) != 0 || newValTypeName != oldValueStruct.Name {
oldMap, err := ebpf.LoadPinnedMap(pinPath, &ebpf.LoadPinOptions{})
if err == nil {
return createEmptyMap(newMapSpec, pinPath, mapName, oldMap)
} else {
return createEmptyMap(newMapSpec, pinPath, mapName, nil)
}
}
}
}
log.Info("pass all check")
return nil, nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The migrateMap function has high complexity and significant code duplication. The logic to decide whether to recreate a map is repeated in multiple branches. This makes the code hard to read and maintain. Consider refactoring this function to extract the map recreation logic into a helper function and simplify the conditional checks.

@codecov
Copy link
Copy Markdown

codecov bot commented Sep 16, 2025

Codecov Report

❌ Patch coverage is 0.53619% with 371 lines in your changes missing coverage. Please review.
✅ Project coverage is 37.81%. Comparing base (716c0d3) to head (3c861d1).
⚠️ Report is 208 commits behind head on main.

Files with missing lines Patch % Lines
pkg/bpf/restart/bpf_update.go 0.00% 263 Missing ⚠️
pkg/bpf/restart/new_version_mapspec_loader.go 0.00% 95 Missing ⚠️
pkg/bpf/bpf.go 20.00% 6 Missing and 2 partials ⚠️
pkg/bpf/ads/sock_connection.go 0.00% 1 Missing ⚠️
pkg/bpf/ads/sock_ops.go 0.00% 1 Missing ⚠️
pkg/bpf/workload/skb.go 0.00% 1 Missing ⚠️
pkg/bpf/workload/sock_connection.go 0.00% 1 Missing ⚠️
pkg/bpf/workload/sock_ops.go 0.00% 1 Missing ⚠️
Files with missing lines Coverage Δ
pkg/bpf/ads/sock_connection.go 0.00% <0.00%> (ø)
pkg/bpf/ads/sock_ops.go 0.00% <0.00%> (ø)
pkg/bpf/workload/skb.go 0.00% <0.00%> (ø)
pkg/bpf/workload/sock_connection.go 0.00% <0.00%> (ø)
pkg/bpf/workload/sock_ops.go 0.00% <0.00%> (ø)
pkg/bpf/bpf.go 45.48% <20.00%> (+5.05%) ⬆️
pkg/bpf/restart/new_version_mapspec_loader.go 0.00% <0.00%> (ø)
pkg/bpf/restart/bpf_update.go 0.00% <0.00%> (ø)

... and 93 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d7da837...3c861d1. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned this is a very big change, could introduce regression. Can you please add a feature flag, we can turn on until we have enough tests and documents


// LoadCompileTimeSpecs loads all compile-time MapSpecs from bpf2go-generated packages.
// Returns a nested map: packageName to mapName to *ebpf.MapSpec
func LoadCompileTimeSpecs(config *options.BpfConfig) (map[string]map[string]*ebpf.MapSpec, error) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by CompileTime

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention is to describe that these specs are fixed when the binary is compiled (via bpf2go), as opposed to the specs loaded from the persisted on-disk snapshot. Maybe LoadEmbeddedSpecs will be clearer to reflect its role?

} else {
specs["KmeshCgroupSockWorkload"] = coll.Maps
}
if coll, err := dualengine.LoadKmeshCgroupSockWorkloadCompat(); err != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should only call either LoadKmeshCgroupSockWorkload or LoadKmeshCgroupSockWorkloadCompat

please check

	if helper.KernelVersionLowerThan5_13() {
		spec, err = bpf2go.LoadKmeshSockopsCompat()
	} else {
		spec, err = bpf2go.LoadKmeshSockops()
	}


// LoadCompileTimeSpecs loads all compile-time MapSpecs from bpf2go-generated packages.
// Returns a nested map: packageName to mapName to *ebpf.MapSpec
func LoadCompileTimeSpecs(config *options.BpfConfig) (map[string]map[string]*ebpf.MapSpec, error) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering how we can not miss one map, even after we add a new map later.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote a go file for gen-bpf_stpecs.go in the/back directory, which can automatically generate LoadCompileTimeSpecs by analyzing the bpf2Go.go file. If there are new maps or progs in the future, it can be easily kept synchronized.

func LoadCompileTimeSpecs(config *options.BpfConfig) (map[string]map[string]*ebpf.MapSpec, error) {
specs := make(map[string]map[string]*ebpf.MapSpec)

if config.KernelNativeEnabled() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the tag //go:build enhanced, it is for kernel native mode, so we donot need the below else branch

pkg/bpf/bpf.go Outdated
return nil
}

mapspec, err := restart.LoadCompileTimeSpecs(config)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mapspec, err := restart.LoadCompileTimeSpecs(config)
mapSpecs, err := restart.LoadCompileTimeSpecs(config)

// It will migrate any BPF maps whose on‑disk pin already exists but whose
// compiled MapSpec has changed
func UpdateMapHandler(versionMap *ebpf.Map, kmBpfPath string, config *options.BpfConfig) *ebpf.Map {
persistedSpecs, err := LoadPersistedSnapshot()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

persistedSpecs can be nil, which cause panic later, please check all

log.Warnf("failed to remove old map pinpath: %v (continuing)", err)
}
case hasNew && !hasOld:
if _, err := migrateMap(nil, newSpec, pkgName, mapName, pinPath); err != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer directly use call the map create function

return versionMap
}

func updateVersionInfo(versionMap *ebpf.Map) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can share with that one storeVersionInfo in pkg/bpf/bpf.go

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to import package restart in pkg/bpf/bpf.go. If i call storeVersionInfo in pkg/bpf/bpf.go, this will result in recursive import.

}

// patchKmesh applies a strategic merge patch to the Kmesh DaemonSet and waits for rollout completion.
func patchKmesh_upgarde(t framework.TestContext, patchData string) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo upgrade

func upgradeKmesh(t framework.TestContext) {
newImage := os.Getenv("KMESH_UPGRADE_IMAGE")
if newImage == "" {
newImage = "localhost:5000/kmesh"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isnot this same image with old daemonset?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I attached a new image which is build in run_test.sh set_daemonupgarde_testcase_image() to KMESH_UPGRADE_IMAGE. But E2e test istio 1.23~1.25 report no space left in device. I think the reason is there are two images existing ((localhost:5000/kmesh) and localhost:5000/kmesh/test-upgrade-map-change)

@LiZhenCheng9527
Copy link
Copy Markdown
Contributor

We update configuration of e2e to fix this errors. Please rebase and then resubmit the code.

Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
@hzxuzhonghu
Copy link
Copy Markdown
Member

lgtm

please fix the ci

Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
@072020127
Copy link
Copy Markdown
Contributor Author

The CI is frequently getting stuck on the eBPF unit test section. However, the code in this PR does not modify any eBPF-related modules, and the eBPF unit tests are able to pass locally. I have re-tested 4 times. In those 4 tests, the eBPF unit tests ran for 1 hour twice, and reported a "no space left" error the other two times. Therefore, I think it may be a problem with the CI environment
a0c1e8787abbd9999d7788f8a560350a

@LiZhenCheng9527
Copy link
Copy Markdown
Contributor

LiZhenCheng9527 commented Oct 30, 2025

This issue was caused by using the Ubuntu 22.04 image version in CI. However, it has been resolved by upgrading the Ubuntu image version.

#1513

So maybe you update the Ubuntu image in CI can solve this issue

Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
@LiZhenCheng9527
Copy link
Copy Markdown
Contributor

/lgtm
/approve

@kmesh-bot
Copy link
Copy Markdown
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LiZhenCheng9527

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kmesh-bot kmesh-bot merged commit f4d1eea into kmesh-net:main Oct 30, 2025
14 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[OSPP 2025] Kmesh-daemon upgrades traffic without disruption

4 participants