Skip to content

Updating the header file with DNS rules during policy updates#33412

Closed
vipul-21 wants to merge 1 commit intocilium:mainfrom
vipul-21:singhvipul/dnsrules
Closed

Updating the header file with DNS rules during policy updates#33412
vipul-21 wants to merge 1 commit intocilium:mainfrom
vipul-21:singhvipul/dnsrules

Conversation

@vipul-21
Copy link
Copy Markdown
Contributor

Writing the DNS rules to the endpoint header/json file as we apply the policy changes, This keeps the ep_config.json and ep_config.h in sync with the current rules being applied the policy.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 26, 2024
@github-actions github-actions bot added the sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. label Jun 26, 2024
@vipul-21 vipul-21 force-pushed the singhvipul/dnsrules branch from 08e875a to 7b9de21 Compare June 26, 2024 19:16
@vipul-21
Copy link
Copy Markdown
Contributor Author

/test

Signed-off-by: Vipul Singh <singhvipul@microsoft.com>
@vipul-21 vipul-21 force-pushed the singhvipul/dnsrules branch from 7b9de21 to adc0402 Compare June 26, 2024 20:54
@vipul-21 vipul-21 marked this pull request as ready for review June 26, 2024 22:39
@vipul-21 vipul-21 requested review from a team as code owners June 26, 2024 22:39
@vipul-21 vipul-21 requested review from pippolo84 and sayboras June 26, 2024 22:39
Copy link
Copy Markdown
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I have a comment below on the approach.

Comment on lines +51 to +52
log.Infof("Updating the DNS rules for endpoint %d", dr.redirect.endpointID)
dr.redirect.localEndpoint.SyncEndpointHeaderFile()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SyncEndpointHeaderFile is already called within the endpoint pkg (pkg/endpoint) and during policy recalculation. Why is this necessary here especially expanding the scope of the proxy to reach into the management of endpoints?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey sorry for the delay in response. So the intention here is to update the DNS rules to the ep_config.json when we apply the policy.
Afaik, it does not write the dns rules to the file when the policy is applied. So I was calling this function which is triggered when we want to write the dns rules to the file.
Current implementation, I see that the data is written to the fs after the policy applied, but dns rules are not. Is there another way we can incorporate getting the DNS rules to be written to the fs ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current implementation, I see that the data is written to the fs after the policy applied, but dns rules are not. Is there another way we can incorporate getting the DNS rules to be written to the fs ?

I'm not sure I understand the problem you are describing. The DNS rules are in fact written to the fs as they are fetched from the same function:

rules := e.owner.GetDNSRules(e.ID)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also another note to keep in mind here, see the comment regarding locks and deadlocks on the line above of the linked code: this PR might further complicate the locking pattern by now having the proxy reach into the headerfile.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree; this is probably not necessary here. Rather, we should be syncing the endpoint header file explicitly on regeneration, which IIRC is not done now.

@squeed
Copy link
Copy Markdown
Contributor

squeed commented Jul 12, 2024

Ah, right, I remember what the problem is here: the cached DNS rules include the IP of the dns server, a.k.a the peer pod, which can change at any point in time, independent of the lifecycle of the subject endpoint.

So, DNS rules can be out-of-date even when no endpoint-specific edges have been missed. This is a bummer.

The solution, IMO, is to replace the Trigger with a Controller that runs periodically. That way we can be reasonably sure that the set of rules in question is up-to-date.

@vipul-21
Copy link
Copy Markdown
Contributor Author

Ah, right, I remember what the problem is here: the cached DNS rules include the IP of the dns server, a.k.a the peer pod, which can change at any point in time, independent of the lifecycle of the subject endpoint.

So, DNS rules can be out-of-date even when no endpoint-specific edges have been missed. This is a bummer.

The solution, IMO, is to replace the Trigger with a Controller that runs periodically. That way we can be reasonably sure that the set of rules in question is up-to-date.

So do you also recommend calling the controller when there are new rules being added Or solely rely on the periodic update ? I guess there can be a scenario where rules are applied but state in fs is not updated as controller is yet to run. I think that's what you meant by reasonably sure part.
How frequent can/should the controller run ? Maybe have that configurable too.

@christarazi
Copy link
Copy Markdown
Member

Ah, right, I remember what the problem is here: the cached DNS rules include the IP of the dns server, a.k.a the peer pod, which can change at any point in time, independent of the lifecycle of the subject endpoint.

So, DNS rules can be out-of-date even when no endpoint-specific edges have been missed. This is a bummer.

The solution, IMO, is to replace the Trigger with a Controller that runs periodically. That way we can be reasonably sure that the set of rules in question is up-to-date.

Writing the DNS rules to disk is ultimately best effort because it's done synchronously and if the pod is deleted in the meantime before the latest rules can be synced to disk, then

Ah, right, I remember what the problem is here: the cached DNS rules include the IP of the dns server, a.k.a the peer pod, which can change at any point in time, independent of the lifecycle of the subject endpoint.
So, DNS rules can be out-of-date even when no endpoint-specific edges have been missed. This is a bummer.
The solution, IMO, is to replace the Trigger with a Controller that runs periodically. That way we can be reasonably sure that the set of rules in question is up-to-date.

So do you also recommend calling the controller when there are new rules being added Or solely rely on the periodic update ? I guess there can be a scenario where rules are applied but state in fs is not updated as controller is yet to run. I think that's what you meant by reasonably sure part. How frequent can/should the controller run ? Maybe have that configurable too.

I'm still not sure I understand the problem being solved here.

Basically, SyncEndpointHeaderFile() is called each time the endpoint makes a DNS request. SyncEndpointHeaderFile() triggers code with a 5s interval meaning the sync to the fs only occurs once every 5s.

Therefore, I don't yet see the arguement for why we should sync DNS rules to the fs on policy calculation (endpoint regeneration) when the code already syncs the DNS rules when there's a DNS request made, once every 5s.

@christarazi christarazi marked this pull request as draft July 19, 2024 21:29
@squeed
Copy link
Copy Markdown
Contributor

squeed commented Jul 24, 2024

@christarazi @vipul-21

Therefore, I don't yet see the arguement for why we should sync DNS rules to the fs on policy calculation (endpoint regeneration) when the code already syncs the DNS rules when there's a DNS request made, once every 5s.

There are two things cached in the endpoint directory for fast recovery on startup:

  1. The list of domain names + IPs seen by this endpoint
  2. The list of ports + dns server pod IPs that are allowed

No. 1 cannot change without a DNS request, so it makes sense that this is updated per-request. However, no. 2 has a completely unrelated lifecycle. In fact, the lifecycle may be completely independent of the node itself, as CoreDNS being rescheduled on a remote node requires this to be updated.

So, that's why we should periodically refresh the cached DNS information (a.k.a "write the header file"). Either we build a mechanism to somehow propagate dns rules selector updates (hard!), or we just periodically update endpoint cached DNS info.

Make sense?

@christarazi
Copy link
Copy Markdown
Member

@christarazi @vipul-21

Therefore, I don't yet see the arguement for why we should sync DNS rules to the fs on policy calculation (endpoint regeneration) when the code already syncs the DNS rules when there's a DNS request made, once every 5s.

There are two things cached in the endpoint directory for fast recovery on startup:

1. The list of domain names + IPs seen by this endpoint

2. The list of ports + dns server pod IPs that are allowed

No. 1 cannot change without a DNS request, so it makes sense that this is updated per-request. However, no. 2 has a completely unrelated lifecycle. In fact, the lifecycle may be completely independent of the node itself, as CoreDNS being rescheduled on a remote node requires this to be updated.

So, that's why we should periodically refresh the cached DNS information (a.k.a "write the header file"). Either we build a mechanism to somehow propagate dns rules selector updates (hard!), or we just periodically update endpoint cached DNS info.

Make sense?

That clarifies things a bit, thanks. What is the consequence if we don't update (2) on a periodic basis? I understand the rules are "out of date / out of sync" on the fs, but what is the actual consequence of them being out of date? Does the list of ports + DNS server pods IPs (coredns) never get updated in the endpoint caches?

@vipul-21
Copy link
Copy Markdown
Contributor Author

vipul-21 commented Aug 16, 2024

@christarazi

What is the consequence if we don't update (2) on a periodic basis?

As of now there shouldn't be any consequence, but this was discussed in sig-policy meeting in June: https://docs.google.com/document/d/1p6LuzoKR55_HgQJTWFInwj2FlIeo7_uAFXYuXe_Lw7Q/edit related to HA DNS Proxy CFP.

Need to determine L7 policy per-endpoint. Can we use epconfig.json to load per-endpoint policy, or do we need another discoverability mechanism?
This is not fast enough, but a PR to write epconfig.json on regeneration would be reasonable

Will try to do the controller approach as suggested.

@squeed
Copy link
Copy Markdown
Contributor

squeed commented Aug 19, 2024

@christarazi

What is the consequence if we don't update (2) on a periodic basis? [...] Does the list of ports + DNS server pods IPs (coredns) never get updated in the endpoint caches?

That's exactly right. This is caching the set of allowed IPs so we can correctly serve proxied DNS requests while the agent is starting up.

Since DNS rules are per port + destination, an endpoint could potentially have different DNS rules per destination. We also would like to be able to serve DNS requests very quickly when restarting the agent. Since a fresh agent will not know IP:Identity mappings, we cache IPs rather than identities. But then we never refresh the cache.

@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it
has not had recent activity. It will be closed if no further activity
occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Sep 19, 2024
@squeed
Copy link
Copy Markdown
Contributor

squeed commented Sep 19, 2024

@vipul-21 have you had a chance to look at this? It would be a nice fix to have!

@github-actions github-actions bot removed the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Sep 20, 2024
@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it
has not had recent activity. It will be closed if no further activity
occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Oct 20, 2024
@github-actions
Copy link
Copy Markdown

github-actions bot commented Nov 4, 2024

This pull request has not seen any activity since it was marked stale.
Closing.

@github-actions github-actions bot closed this Nov 4, 2024
@vipul-21
Copy link
Copy Markdown
Contributor Author

@vipul-21 have you had a chance to look at this? It would be a nice fix to have!

@squeed Sorry for the delay.(busy with some other stuff).
Yes, let me update this.

@vipul-21
Copy link
Copy Markdown
Contributor Author

vipul-21 commented Nov 27, 2024

@squeed I updated the branch with the controller, but not seeing an option to reopen this PR ? Should I create a new one ?
vipul-21@a76b698

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants