Skip to content

endpoint: Init DNS history trigger only when datapath is ready for it#34059

Merged
joamaki merged 1 commit intocilium:mainfrom
joamaki:pr/joamaki/fix-early-endpoint-header-sync
Jul 29, 2024
Merged

endpoint: Init DNS history trigger only when datapath is ready for it#34059
joamaki merged 1 commit intocilium:mainfrom
joamaki:pr/joamaki/fix-early-endpoint-header-sync

Conversation

@joamaki
Copy link
Copy Markdown
Contributor

@joamaki joamaki commented Jul 29, 2024

This fixes a rare crash that can occur when a restored endpoint is doing DNS requests while the first loader Reinitialize() is still not completed (e.g. waiting for node information).

Crash:

  time="2024-07-26T09:54:49Z" level=debug msg="Updated FQDN with new IPs" IPs="[75.2.60.5]" matchName=isovalent.com. subsys=fqdn
  time="2024-07-26T09:54:49Z" level=debug msg="Waited for endpoints to regenerate due to a DNS response" duration="64.816µs" endpointID=1050 qname=isovalent.com. subsys=daemon
  ...
  time="2024-07-26T09:54:49Z" level=debug msg="writing header file with DNSRules" DNSRulesV2="map[]" ciliumEndpointName=default/ubuntu ..
  panic: runtime error: invalid memory address or nil pointer dereference
  [signal SIGSEGV: segmentation violation code=0x1 addr=0x90 pc=0x2c65ce7]

  goroutine 368 [running]:
  github.com/cilium/cilium/pkg/datapath/types.(*LocalNodeConfiguration).DeviceNames(...)
          /home/jussi/go/src/github.com/cilium/cilium/pkg/datapath/types/node.go:165
  github.com/cilium/cilium/pkg/datapath/linux/config.(*HeaderfileWriter).WriteEndpointConfig(0xc00269ab40, {0x445aaa0, 0xc00067d060?}, 0x0, {0x44df670, 0xc001b28808})
          /home/jussi/go/src/github.com/cilium/cilium/pkg/datapath/linux/config/config.go:1045 +0x127
  github.com/cilium/cilium/pkg/datapath/loader.(*loader).WriteEndpointConfig(0xc001b28808?, {0x445aaa0?, 0xc00067d060?}, {0x44df670?, 0xc001b28808?})

The issue is due to WriteEndpointConfig being called via the endpoint DNS history trigger when the LocalNodeConfiguration is not yet set. Fix the issue being initializing the trigger from regenerateBPF which is called only after datapath reinitialize has completed and it is ready to process the endpoint config writing.

The fix was tested by adding a 5 second sleep into Reinitialize(), both before the compilation lock and before nodeConfig.Store. This reliably reproduced the issue and the fix was effective. Adding these sleeps did not uncover other problems.

A principled long-term fix for this and similar issues lands in #33023 which gates all requests towards the loader and makes sure all relevant data is present.

Fixes: #34019

Fix a nil dereference crash during cilium-agent initialization affecting setups with FQDN policies. The crash is triggered when a restored endpoint performs a DNS request just a the right time during early cilium-agent restoration. Problem is not expected to be persistent and the agent should get pass the problematic part of the initialization on restart.

This fixes a rare crash that can occur when a restored endpoint is doing
DNS requests while the first loader Reinitialize() is still not completed
(e.g. waiting for node information).

Crash:
  time="2024-07-26T09:54:49Z" level=debug msg="Updated FQDN with new IPs" IPs="[75.2.60.5]" matchName=isovalent.com. subsys=fqdn
  time="2024-07-26T09:54:49Z" level=debug msg="Waited for endpoints to regenerate due to a DNS response" duration="64.816µs" endpointID=1050 qname=isovalent.com. subsys=daemon
  ...
  time="2024-07-26T09:54:49Z" level=debug msg="writing header file with DNSRules" DNSRulesV2="map[]" ciliumEndpointName=default/ubuntu ..
  panic: runtime error: invalid memory address or nil pointer dereference
  [signal SIGSEGV: segmentation violation code=0x1 addr=0x90 pc=0x2c65ce7]

  goroutine 368 [running]:
  github.com/cilium/cilium/pkg/datapath/types.(*LocalNodeConfiguration).DeviceNames(...)
          /home/jussi/go/src/github.com/cilium/cilium/pkg/datapath/types/node.go:165
  github.com/cilium/cilium/pkg/datapath/linux/config.(*HeaderfileWriter).WriteEndpointConfig(0xc00269ab40, {0x445aaa0, 0xc00067d060?}, 0x0, {0x44df670, 0xc001b28808})
          /home/jussi/go/src/github.com/cilium/cilium/pkg/datapath/linux/config/config.go:1045 +0x127
  github.com/cilium/cilium/pkg/datapath/loader.(*loader).WriteEndpointConfig(0xc001b28808?, {0x445aaa0?, 0xc00067d060?}, {0x44df670?, 0xc001b28808?})

The issue is due to WriteEndpointConfig being called via the endpoint DNS
history trigger when the LocalNodeConfiguration is not yet set. Fix the
issue being initializing the trigger from regenerateBPF which is called
only after datapath reinitialize has completed and it is ready to process
the endpoint config writing.

The fix was tested by adding a 5 second sleep into Reinitialize(), both
before the compilation lock and before nodeConfig.Store. This reliably
reproduced the issue and the fix was effective. Adding these sleeps
did not uncover other problems.

A principled long-term fix for this and similar issues lands in cilium#33023
which gates all requests towards the loader and makes sure all relevant
data is present.

Fixes: cilium#34019

Signed-off-by: Jussi Maki <jussi@isovalent.com>
@joamaki joamaki requested a review from squeed July 29, 2024 09:49
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 29, 2024
@joamaki joamaki added release-note/bug This PR fixes an issue in a previous release of Cilium. needs-backport/1.15 and removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. needs-backport/1.15 labels Jul 29, 2024
@joamaki
Copy link
Copy Markdown
Contributor Author

joamaki commented Jul 29, 2024

/test

@joamaki joamaki marked this pull request as ready for review July 29, 2024 11:08
@joamaki joamaki requested a review from a team as a code owner July 29, 2024 11:08
@joamaki joamaki requested a review from tommyp1ckles July 29, 2024 11:08
@joamaki joamaki closed this Jul 29, 2024
@joamaki joamaki reopened this Jul 29, 2024
@joamaki joamaki enabled auto-merge July 29, 2024 11:09
@joamaki joamaki added this pull request to the merge queue Jul 29, 2024
Merged via the queue into cilium:main with commit 258819d Jul 29, 2024
@joamaki joamaki deleted the pr/joamaki/fix-early-endpoint-header-sync branch July 29, 2024 11:23
@nbusseneau nbusseneau mentioned this pull request Aug 3, 2024
23 tasks
@github-actions github-actions bot added backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. and removed backport-pending/1.16 labels Aug 6, 2024
squeed added a commit to squeed/cilium that referenced this pull request Nov 11, 2024
This trigger may be, well, triggered when reacting to a DNS request --
even when the agent is starting up. This could lead to a deadlock, as
the datapath is not able to write the endpoint header file until the
agent is started, but the agent cannot finish starting as the endpoint
is locked.

The fix for this is to remove the unnecessary trigger initialization on
endpoint parsing; we will always start it on first regeneration.

This catches a case missed in cilium#34059, which only fixed the new-endpoint
case.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
aanm pushed a commit that referenced this pull request Nov 12, 2024
This trigger may be, well, triggered when reacting to a DNS request --
even when the agent is starting up. This could lead to a deadlock, as
the datapath is not able to write the endpoint header file until the
agent is started, but the agent cannot finish starting as the endpoint
is locked.

The fix for this is to remove the unnecessary trigger initialization on
endpoint parsing; we will always start it on first regeneration.

This catches a case missed in #34059, which only fixed the new-endpoint
case.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
bimmlerd pushed a commit to bimmlerd/cilium that referenced this pull request Nov 12, 2024
[ upstream commit 0bb676d ]

This trigger may be, well, triggered when reacting to a DNS request --
even when the agent is starting up. This could lead to a deadlock, as
the datapath is not able to write the endpoint header file until the
agent is started, but the agent cannot finish starting as the endpoint
is locked.

The fix for this is to remove the unnecessary trigger initialization on
endpoint parsing; we will always start it on first regeneration.

This catches a case missed in cilium#34059, which only fixed the new-endpoint
case.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request Nov 12, 2024
[ upstream commit 0bb676d ]

This trigger may be, well, triggered when reacting to a DNS request --
even when the agent is starting up. This could lead to a deadlock, as
the datapath is not able to write the endpoint header file until the
agent is started, but the agent cannot finish starting as the endpoint
is locked.

The fix for this is to remove the unnecessary trigger initialization on
endpoint parsing; we will always start it on first regeneration.

This catches a case missed in #34059, which only fixed the new-endpoint
case.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. release-note/bug This PR fixes an issue in a previous release of Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SIGSEGV: invalid memory address or nil pointer dereference in DeviceNames

3 participants