-
-
Notifications
You must be signed in to change notification settings - Fork 116
Adding machine.service.internal dns entries #123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding machine.service.internal dns entries #123
Conversation
|
That's an interesting use case! I'd love to learn a bit more details on how you plan to elect and denote that leader. Is this some kind of an external program that will call the service with a hardcoded machine name and essentially tell it to become a leader? What kind of service is it (any known 3rd-party software or your custom one)? ConcernsYour proposal of adding This is pretty straightforward and I don't see a reason to not introduce the same or identifier-like validation for names regardless of this proposal. I didn't introduce it in the first place because I was just overwhelmed by the amount of design decisions I had to make when I started the project so this one was simply ignored 😅 Another concern is that a machine name can be changed with TradeoffsThe reason for this is the eventually consistent store without consensus (corrosion) we use. This is a tradeoff we made. For that reason, we also have 128bit unique IDs for machines which if I'm not mistaken are not visible to a user in any Your proposal reminded me about the Fly.io's
What do you think of using a machine ID instead of a name as the first iteration? How inconvenient that would be given that we can provide a way to inspect a machine to get its ID? So the FQDN would be Do you think we can just ignore the machine name conflicts for now and proceed with the Other ideasThis is not something I've been actively thinking about but still some food for thoughts. I image that the uncloud cluster as an "infra platform" can potentially provide a way to help replicated services to elect leaders. Not sure how this can work, e.g. an API endpoint which service replicas can query asking "am I a leader?". And the uncloud machines will do all the quorum/leader election heavy lifting. Somewhat similar to how leader election could be implemented in k8s: https://github.com/kubernetes/client-go/blob/master/examples/leader-election/main.go. The client leaderelection package includes some primitives you can use in your Go code to elect and check leadership. It’s implemented using etcd leases if I recall correctly. Example implementation. Again, this is out of scope for the foreseeable future but still good to think of what an ideal solution could look like. |
|
Apologies for the delay. Just getting back to this. I think the idea of leader elections provided by uncloud is an interesting idea (and I could see using something like that in the future), but my initial use case is much less interesting -- I need to be able to address a specific machine's instance of a service because I have a (manually denoted) "leader". Although, "primary" might be a better term to use here than "leader". Basically I have a process running on a single machine that updates a large file and I need to sync it to the other machines. For smaller files, I would just copy it to object storage and then download the update from each machine, but this one is large enough that I would like to rsync the changes from leader/primary machine that generates the update. I also expect to have a few of these cases where different machines might be the "primary", so I was thinking of a simple sync service with the same functionality on each machine that I'd be able to issue commands like "B/C sync X from A", "A/C sync Y from B", etc. There might be a better way to accomplish the above, but that's how I got to this idea of a
I don't recall if it was fly.io's DNS names, specifically, but the idea originated with seeing something similar in terms of internal names. I think using the machine-id (rather than the name) makes sense and would work for me. It definitely solves a lot of issues that way with renaming and DNS-compliant names. I see fly.io uses I'll update this PR to use the machine id and whichever |
|
Also, a thought on this bit about "infra platform":
While the leader election part might be a later enhancement, I could see some potential value in something like AWS's EC2 instance metadata endpoint:
Or maybe something like that already exists? The services' environment has |
|
And another related bit to general infra: I am running a message queue/bus service for coordination, but it's very simple usage so far. If the corrosion layer uncloud is already using could be re-used for some simple service-customizable state and/or a low-volume message bus, I could see that being very useful for many applications. |
|
Thank you for clarifying your use case! 🙏 I think the approach you suggested with using DNS names
Yeah, that's something I was actually thinking about when describing an endpoint for providing leader election info. That's definitely an interesting idea I'll keep in mind. But as you said using ENV for immutable values is likely the easiest and more versatile approach as it doesn't require the service to write any code for fetching the data.
It is indeed a great idea. Although, I'm not yet sure if it could be used as a general-purpose message bus. Corrosion is very sqlite-centric and it doesn't provide an easy way to distribute schema changes, e.g. to ensure a new table is created on all nodes. A queue could probably be emulated through a single db table we statically create on each machine. But we still need to abstract all these low level details and expose some generic interface to a user. I'm also not yet sure if corrosion will stay long-term as we barely use its crdt capabilities and we don't really need the relational model. So for now, I'd prefer to not introduce new features that potentially make it harder to replace corrosion with something even more lightweight in the future. Would you like me to add a new DNS entry for containers on the specific machine? Or you want to try it on your own? |
|
Hello,
Is the service deployed to multiple machines, but you want to specifically connect to that particular one? or you explicitly deployed it to a machine using Asking because if the service name is unique (there are no multiple entries), then you can just do Now if the service is replicated but you designated one as the leader, then you will need to find a way to share what are the available machine names so Fly.io's DNS server went a bit further than just plain A/AAAA records and have TXT ones that contains the list of machines running that particular service ( I also think that machine-id reference is a good compromise, but having a way to obtain the list of all machines for the specific service can also help. Either case, it will require some coding on your end to determine which one is the leader and then connect to it. 😊 Cheers, |
DNS is simple but still powerful discovery mechanism that is easily accessible on the client without requiring to implement any custom protocols. It's not really hard to do what Fly does. Just let me know if something like TXT records with the information about service containers will be useful for you and we can easily add them. I've just played with Fly's DNS for reference: # dig +short TXT _apps.internal
"ps-nginx,test-nginx-ps,uncloud"
# dig +short TXT _instances.internal
"instance=d8913edc2d66e8,app=uncloud,ip=fdaa:9:8ace:a7b:1b4:2da6:5200:2,region=syd;instance=d8d3370ae66948,app=uncloud,ip=fdaa:9:8ace:a7b:4df:9cc6:43e0:2,region=syd;instance=e8254ddb603638,app=uncloud,ip=fdaa:9:8ace:a7b:2d8:2e88:fb30:2,region=syd"
# dig +short TXT vms.uncloud.internal
"d8913edc2d66e8 syd,d8d3370ae66948 syd,e8254ddb603638 syd"
# dig +short TXT regions.uncloud.internal
"syd"
# dig +short AAAA d8d3370ae66948.vm.uncloud.internal
fdaa:9:8ace:a7b:4df:9cc6:43e0:2 |
1014ec1 to
1457dab
Compare
I'm updating (simplifying really) this MR now. I'm working on getting tests running locally, but the change will (I think) be a pretty simple addition now that it's just using the machine ID. Something like this in
I'm planning to just run one replica per machine, but for what I'm doing, it would be fine with multiple -- the machine-specific data is in a host volume, so any instance on the machine would be able to access it. At the moment, I can't think of a use case for addressing a specific instance of multiple replicas on a machine that I couldn't just use the service ID DNS entry for instead.
I imagine those will be useful for other applications (by me or others) in the future, but I should be fine with the machine-addressable service option for now. Using the term "leader" was probably a mistake on my part. The instances of this "sync service" will be identical between replicas/machines, but they will receive commands like "sync data X from machine A". Machine A will see it's for itself and noop. Machine B will run something like |
| newServiceIPs[ctr.ServiceID()] = append(newServiceIPs[ctr.ServiceID()], ip) | ||
|
|
||
| // Add <machine-id>.m.<service-name> as a lookup | ||
| serviceNameWithMachineID := record.MachineID + ".m." + ctr.ServiceName() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking a bit more about this, I think we can add machine names here as well if you like. If a machine name is an invalid domain label, a DNS request simply won't make it to this resolver. So it would be up to the user to ensure their machine names are valid DNS labels if they want to use this DNS look up. You can list all machines with store.ListMachines, put them in a map and then look up here to not query DB on each container with store.GetMachine.
For the machine renaming, ideally we need to subscribe to the changes in the machines table to trigger this update on any change. But in practice I don't think that machines will be changed often, and any container change will correct the outdated records anyway. So leaving a TODO will be enough for now.
Note also that the services number in the DNS records updated. debug log below is calculated incorrectly after this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having some trouble running tests. Are some failing on main currently?
=== RUN TestUpdateMachine/update_with_empty_request
machine_test.go:442:
Error Trace: /Users/justin/Projects/external/uncloud/test/e2e/machine_test.go:442
Error: Received unexpected error:
rpc error: code = Unimplemented desc = unknown method UpdateMachine for service api.Cluster
Test: TestUpdateMachine/update_with_empty_request
=== NAME TestServiceLifecycle/3_replicas_with_volume_auto-created
assert.go:102:
Error Trace: /Users/justin/Projects/external/uncloud/test/e2e/assert.go:102
/Users/justin/Projects/external/uncloud/test/e2e/assert.go:32
/Users/justin/Projects/external/uncloud/test/e2e/service_test.go:1572
Error: Not equal:
expected: container.RestartPolicy{Name:"unless-stopped", MaximumRetryCount:0}
actual : container.RestartPolicy{Name:"always", MaximumRetryCount:0}
Diff:
--- Expected
+++ Actual
@@ -1,3 +1,3 @@
(container.RestartPolicy) {
- Name: (container.RestartPolicyMode) (len=14) "unless-stopped",
+ Name: (container.RestartPolicyMode) (len=6) "always",
MaximumRetryCount: (int) 0
Test: TestServiceLifecycle/3_replicas_with_volume_auto-created
Also, I fixed the count in the debug log line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just pushed the latest image ghcr.io/psviderski/ucind built from main that is used in e2e tests for running machines in Docker containers. It needs to be rebuilt before running e2e tests that depend on the new code on the backend (uncloudd daemon). I usually run make ucind-image locally.
Ah sorry, this is not relevant to your issues above, but I broke ucind (Uncloud in Docker) cluster and tests on main after introducing the unregistry component. Let me fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the unregistry component and rebuilt the ucind image.
Tests should be fixed now: https://github.com/psviderski/uncloud/actions/runs/17936623170
1457dab to
83e76a6
Compare
|
I pushed a couple more commits:
Not sure if there's a better way to do that test, but it was the best Claude and I could come up with so far. 😄 |
Adds TestInternalDNS to verify DNS functionality including: - Service name resolution to all container IPs across machines - Machine-specific DNS lookups using <machine-id>.m.<service-name>.internal format - Service ID DNS resolution for backward compatibility Uses wbitt/network-multitool for DNS queries and host volume mounts to capture results from within the cluster network. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
3bd6b1e to
304aa15
Compare
|
Nice work!
Looks good! I was also thinking of making a toggle for the
It's fine 👍. I generally try to not create new test suits that create more ucind clusters as they consume non-negligible amount of RAM and slow down tests. But it's alright, I can refactor this later if needed. Please address the linter complains and I'm happy to merge this: https://github.com/psviderski/uncloud/actions/runs/17938964884/job/51013199831?pr=123 |
I think I fixed it. (Also, I noticed that running it locally seems to complain about a few pre-existing
Yeah, that makes sense, both the toggle and adding the id to |
I have a use case where I want to connect to a service container running on a particular machine, and this adds an additional entry to the internal DNS service resolver with
machine-name.service-name(in addition to the existingservice-nameandservice-identries).This is just an initial, quick stab at the idea. I think it might be cleaner to retrieve machine name as part of the cluster store's containers subscribe/list functions with a join in its select instead:
And then pass that through the
ContainerRecordthey return andinternal/machine/dns/resolver.gouses.Anyway, I thought I'd get your input on this first before I go further with that.
My use case is ultimately about needing to denote one machine as a "leader" for the service, and this approach seemed like something that could be generally useful (for myself and others). I could also accomplish this with two minor variants of the service, but that's a little more manual (needing to manage the
x-machinesfor the followers).