|
| 1 | +--- |
| 2 | +title: Kmesh-daemon upgrades traffic without disruption |
| 3 | +authors: |
| 4 | +- "@072020127" |
| 5 | +reviews: |
| 6 | +- |
| 7 | +approves: |
| 8 | +- |
| 9 | + |
| 10 | +create-date: 2025-07-08 |
| 11 | + |
| 12 | +--- |
| 13 | + |
| 14 | +## Kmesh-daemon upgrades traffic without disruption |
| 15 | + |
| 16 | +### Summary |
| 17 | + |
| 18 | +Add traffic-preserving upgrades to Kmesh-daemon. |
| 19 | + |
| 20 | +### Motivation |
| 21 | + |
| 22 | +Currently, Kmesh supports traffic-preserving restarts but does not support traffic-preserving upgrades. During upgrades, existing eBPF map state may be discarded if the map definitions change, leading to connection drops, policy resets, or performance metric loss. |
| 23 | + |
| 24 | +This proposal improves the upgrade experience by: |
| 25 | + |
| 26 | +- Preserving important state (flows, policies, metrics) across versions |
| 27 | +- Allowing safe, autonomous rolling upgrades in Kubernetes environments |
| 28 | +- Reducing operational risk and improving reliability in production deployments |
| 29 | + |
| 30 | +#### Goals |
| 31 | + |
| 32 | +The purpose of this proposal is to enable seamless traffic continuity during version upgrades by detecting map changes and migrating data safely. |
| 33 | + |
| 34 | +### Design Details |
| 35 | + |
| 36 | +#### Map Compatibility Detection |
| 37 | + |
| 38 | +1.**Runtime MapSpec Loader**: The comparison logic begins by loading each map’s runtime `MapSpec` which includes `MapType`, `KeySize`, `ValueSize`, `MaxEntries`, `Key` and `Value`. |
| 39 | +Runtime map compatibility inspection is done by calling `loadCompileTimeSpecs`, which loads each embedded CollectionSpec generated by bpf2go. This function iterates over the enabled BPF engines (e.g., KernelNative, DualEngine, General) based on config and returns a nested registry keyed first by a logical package name (e.g., KmeshCgroupSock) and then by map name. |
| 40 | + |
| 41 | +```go |
| 42 | +func loadCompileTimeSpecs(config *options.BpfConfig) (map[string]map[string]*ebpf.MapSpec, error) { |
| 43 | + specs := make(map[string]map[string]*ebpf.MapSpec) |
| 44 | + |
| 45 | + if config.KernelNativeEnabled() { |
| 46 | + // KernelNative: cgroup_sock |
| 47 | + if coll, err := kernelnative.LoadKmeshCgroupSock(); err != nil { |
| 48 | + return nil, fmt.Errorf("load KernelNative KmeshCgroupSock spec: %w", err) |
| 49 | + } else { |
| 50 | + specs["KmeshCgroupSock"] = coll.Maps |
| 51 | + } |
| 52 | + ... // other KernelNative |
| 53 | + } else if config.DualEngineEnabled() { |
| 54 | + // DualEngine: cgroup_sock workload |
| 55 | + if coll, err := dualengine.LoadKmeshCgroupSockWorkload(); err != nil { |
| 56 | + return nil, fmt.Errorf("load DualEngine KmeshCgroupSockWorkload spec: %w", err) |
| 57 | + } else { |
| 58 | + specs["KmeshCgroupSockWorkload"] = coll.Maps |
| 59 | + } |
| 60 | + ... // other DualEngine |
| 61 | + } |
| 62 | + |
| 63 | + // General: tc_mark_encrypt |
| 64 | + if coll, err := general.LoadKmeshTcMarkEncrypt(); err != nil { |
| 65 | + return nil, fmt.Errorf("load General KmeshTcMarkEncrypt spec: %w", err) |
| 66 | + } else { |
| 67 | + specs["KmeshTcMarkEncrypt"] = coll.Maps |
| 68 | + } |
| 69 | + ... // other General |
| 70 | + |
| 71 | + return specs, nil |
| 72 | +} |
| 73 | +``` |
| 74 | + |
| 75 | +2.**MapSpec Snapshot**: During Kmesh-daemon startup, each `MapSpec` generated from the compiled BPF object is stored in a user-space registry on normal startup or Update-type startup. Because raw btf.Type objects can’t be directly marshaled, a custom representation is used: |
| 76 | + |
| 77 | +1. MemberInfo: records each struct field’s name, typeName, offset, and bitfieldSize. If the field itself is a struct, it carries a nested StructInfo. |
| 78 | + |
| 79 | +2. StructInfo: represents a whole struct, storing its name and a slice of MemberInfo entries. If the structure of the Key/Value is not a structure (e.g., int), the `Name` will save the structure name and the `Members` will be null. |
| 80 | + |
| 81 | +3. PersistedMapSpec: stores the metadata for each map — name, type, sizes, max entries, flags — along with the `StructInfo` for its key and value. |
| 82 | + |
| 83 | +The structure that ends up being written to disk is the `PersistedSnapshot` which is keyed first by a logical package name (e.g., KmeshCgroupSock) and then by map name. |
| 84 | + |
| 85 | +```go |
| 86 | +type MemberInfo struct { |
| 87 | + Name string `json:"name"` |
| 88 | + TypeName string `json:"typeName"` |
| 89 | + Offset uint32 `json:"offset"` |
| 90 | + BitfieldSize uint32 `json:"bitfieldsize"` // only have value when the type is bitfield |
| 91 | + Nested *StructInfo `json:"nested,omitempty"` |
| 92 | +} |
| 93 | + |
| 94 | +type StructInfo struct { |
| 95 | + Name string `json:"name"` |
| 96 | + Members []MemberInfo `json:"members"` |
| 97 | +} |
| 98 | + |
| 99 | +type PersistedMapSpec struct { |
| 100 | + Name string `json:"name"` |
| 101 | + Type string `json:"type"` // MapType.String() |
| 102 | + KeySize uint32 `json:"keySize"` |
| 103 | + ValueSize uint32 `json:"valueSize"` |
| 104 | + MaxEntries uint32 `json:"maxEntries"` |
| 105 | + Flags uint32 `json:"flags"` |
| 106 | + KeyInfo StructInfo `json:"keyInfo"` // get from btf.Struct |
| 107 | + ValueInfo StructInfo `json:"valueInfo"` |
| 108 | +} |
| 109 | + |
| 110 | +type PersistedSnapshot struct { |
| 111 | + Maps map[string]map[string]PersistedMapSpec `json:"maps"` |
| 112 | +} |
| 113 | +``` |
| 114 | + |
| 115 | +3.**Persisted MapSpec Loader**: The daemon reads the previously written snapshot file and unmarshals the JSON into the `PersistedSnapshot` structure. This provides the baseline oldMapSpec set used for compatibility checking against newly compiled specs. |
| 116 | + |
| 117 | +4.**Layout Diffing**: A recursive function `diffStructInfoAgainstBTF` is implemented to compare old and new btf.Struct definitions field by field. It detects field additions, removals, type changes, offset shifts, and nested structure changes, and uses a visited map to avoid infinite recursion in recursive types. This function provides a fine-grained structural diff to guide compatibility decisions. |
| 118 | + |
| 119 | +```go |
| 120 | +type StructDiff struct { |
| 121 | + Removed bool // fields present in A but missing in B |
| 122 | + Added bool // fields present in B but missing in A |
| 123 | + TypeChanged bool // same-name fields whose type changed |
| 124 | + OffsetChanged bool // same-name fields whose offset changed |
| 125 | + NestedChanged bool // same-name fields of struct type whose nested layout changed |
| 126 | +} |
| 127 | +``` |
| 128 | + |
| 129 | +#### Map Migration Logic |
| 130 | + |
| 131 | +1.**New Map Creation**: When a layout change is detected, a new map is created based on the latest `MapSpec`, with its path set to the old map path appended with "_tmp", and temporarily pinned to an alternate location. If no change is detected, the existing map is left intact and no further action is taken. |
| 132 | + |
| 133 | +2.**Atomic Pin Swap**: Once data migration completes, the daemon proceeds to unpin the old map. It then closes the old map’s file descriptor, attempts to remove the old map’s pin file, and finally renames the temporary pinned path of the new map to the original map’s pin path. |
| 134 | + |
| 135 | +```go |
| 136 | +if err := oldMap.Unpin(); err != nil && !os.IsNotExist(err) { |
| 137 | + log.Warnf("failed to unpin old map %s: %v (continuing)", pinPath, err) |
| 138 | +} |
| 139 | +if err := oldMap.Close(); err != nil { |
| 140 | + log.Warnf("failed to close old map FD: %v (continuing)", err) |
| 141 | +} |
| 142 | +if err := os.Remove(pinPath); err != nil && !os.IsNotExist(err) { |
| 143 | + return nil, fmt.Errorf("remove old pin %s failed: %w", pinPath, err) |
| 144 | +} |
| 145 | +if err := os.Rename(tmpPinPath, pinPath); err != nil { |
| 146 | + return nil, fmt.Errorf("rename tmp pin %s to old pin %s failed: %w", tmpPinPath, pinPath, err) |
| 147 | +} |
| 148 | +``` |
| 149 | + |
| 150 | +#### Hot Program Replacement |
| 151 | + |
| 152 | +**Atomic Swap**: Once all maps are migrated, new BPF programs are attached. The upgrade process uses `utils.BpfProgUpdate()` to atomically swap the loaded program with a new one. BpfProgUpdate(progPinPath, cgopt) actually does two steps: |
| 153 | + |
| 154 | +1. LoadPinnedLink: Reopens the existing `bpf_link` from the pinned path before reloading, recovering the same link object in the kernel as the kernel has attached. |
| 155 | + |
| 156 | +2. link.Update(newProgFD): Atomically swaps the BPF program FD on that link to `cgopt.Program`, preserving the existing hook and any accumulated state. |
| 157 | + |
| 158 | +This approach ensures there is no packet loss during the transition. Take `BpfSockOps` for example, if the process is detected as a Restart or Update, the existing pinned link is recovered and updated with the new program: |
| 159 | + |
| 160 | +```go |
| 161 | +func (sc *BpfSockOps) Attach() error { |
| 162 | + var err error |
| 163 | + cgopt := link.CgroupOptions{ |
| 164 | + Path: sc.Info.Cgroup2Path, |
| 165 | + Attach: sc.Info.AttachType, |
| 166 | + Program: sc.KmeshSockopsObjects.SockopsProg, |
| 167 | + } |
| 168 | + // pin bpf_link |
| 169 | + progPinPath := filepath.Join(sc.Info.BpfFsPath, constants.Prog_link) |
| 170 | + if restart.GetStartType() == restart.Restart || restart.GetStartType() == restart.Update { |
| 171 | + if sc.Link, err = utils.BpfProgUpdate(progPinPath, cgopt); err != nil { |
| 172 | + return err |
| 173 | + } |
| 174 | + } else { |
| 175 | + sc.Link, err = link.AttachCgroup(cgopt) |
| 176 | + if err != nil { |
| 177 | + return err |
| 178 | + } |
| 179 | + if err = sc.Link.Pin(progPinPath); err != nil { |
| 180 | + return err |
| 181 | + } |
| 182 | + } |
| 183 | + return nil |
| 184 | +} |
| 185 | +``` |
| 186 | + |
| 187 | +#### Workflow |
| 188 | + |
| 189 | + |
| 190 | + |
| 191 | +#### Testing Plan |
| 192 | + |
| 193 | +1.**Unit Tests**: Validate the functionality of key functions, including `LoadCompileTimeSpecs`, `diffStructInfoAgainstBTF`, `SnapshotSpecsByPkg` and `LoadPersistedSnapshot`. |
| 194 | + |
| 195 | +2.**E2E Tests**: Run Kmesh upgrades with live traffic and verify data continuity, no packet loss, and zero connection resets. |
0 commit comments