-
-
Notifications
You must be signed in to change notification settings - Fork 116
feat: Add uc ps command to list all containers in the cluster
#165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
54c3b4f to
4805bb4
Compare
psviderski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good! Please see a few comments on the use of the custom bubble tea model, optimising the queries, and table layout.
cmd/uncloud/ps.go
Outdated
| err error | ||
| } | ||
|
|
||
| func newSpinnerModel(client *client.Client, message string) spinnerModel { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a custom spinner model and can just use the spinner with one action? https://github.com/charmbracelet/huh?tab=readme-ov-file#bonus-spinner
The reason we have a custom connectModel is because we change the spinner text to the connection we're currently trying when iterating over all of them.
For this command, looks like we're only printing "Collecting container info..." until we get all the containers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
cmd/uncloud/ps.go
Outdated
| func newSpinnerModel(client *client.Client, message string) spinnerModel { | ||
| s := spinner.New() | ||
| s.Spinner = spinner.Jump | ||
| s.Style = lipgloss.NewStyle().Foreground(lipgloss.Color("205")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the connection spinner, I also tried to use the same spinner type and colour used by the compose progress library (uc deploy output) to have a consistent UI language. Would be nice to maintain the consistency here as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
cmd/uncloud/ps.go
Outdated
|
|
||
| var containers []containerInfo | ||
| for _, s := range services { | ||
| service, err := m.client.InspectService(context.Background(), s.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The client calls that are broadcasted to all machines are quite expensive, that's why it takes a while to list all containers at the moment.
You can try using only one call client.Docker.ListServiceContainers
uncloud/internal/machine/docker/client.go
Lines 479 to 481 in a30ed60
| // ListServiceContainers returns all containers on requested machines that belong to the service with the given | |
| // name or ID. If serviceNameOrID is empty, all service containers are returned. | |
| func (c *Client) ListServiceContainers( |
See the ListServices implementation for details:
Line 277 in 458d282
| func (cli *Client) ListServices(ctx context.Context) ([]api.Service, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| machinesNamesByID := make(map[string]string) | ||
| for _, m := range machines { | ||
| machinesNamesByID[m.Machine.Id] = m.Machine.Name | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
machines are always the same, they could be fetched and machinesNamesByID computed only once, not for every service to optimise the response time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cmd/uncloud/ps.go
Outdated
| if len(id) > 12 { | ||
| id = id[:12] | ||
| } | ||
| name := strings.TrimPrefix(ctr.name, "/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is needed, we trim it when deserialising in our api.Container.
Not a big deal but also stringid.TruncateID(c.ID) from the docker package can be used to not hardcode the logic for shortening the container IDs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
cmd/uncloud/ps.go
Outdated
| header = "MACHINE\tSERVICE\tCONTAINER ID\tNAME\tIMAGE\tSTATUS" | ||
| } else { | ||
| header = "SERVICE\tMACHINE\tCONTAINER ID\tNAME\tIMAGE\tSTATUS" | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I quite like the layout of pod listing in k9s (similar to kubectl):

It prints the node/machine column last. We do something similar for uc inspect SERVICE:
I think we need to decide on the layout and stick to it for all relevant commands to decrease the cognitive load on users. I might be wrong but I also feel that changing the order when grouping by another column might not be desirable as it again requires additional effort to read headers and understand where the data is relocated now.
k9s, for example, support sorting by any column and doesn't change the layout. It's like tables on website. You can click on the header and change the sorting order, but don't swap columns. So I believe the --group-by option isn't much different to this. Potentially the flag could even be called --sort. Not sure which one is more intuitive though.
To me, it seems natural to present data ordered from the largest dimension to smallest.
service > container > container properties
If we have a namespace/stack concept, then
namespace > service > container > container properties
The machine a container runs on is somewhat an odd one. It's not one to many relationship like service-container, it's many to many. So it's hard to find a good place for it. Putting it as the first column somewhat breaks the logical hierarchy. Putting it as the last one still allows to find it quickly but it remains less intrusive. That's my current reasoning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved MACHINE to the last column.
One odd thing I noticed, it seems the lipgloss color styling of some cells in the second to last column and cause tabwriter to layout things incorrectly.
Have you encountered that issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah I think I did and that was the reason I used the lipgloss table instead of tabwriter for formatting the uc image ls output:
uncloud/cmd/uncloud/image/ls.go
Lines 261 to 304 in 939645e
| t := table.New(). | |
| // Remove the default border. | |
| Border(lipgloss.Border{}). | |
| BorderTop(false). | |
| BorderBottom(false). | |
| BorderLeft(false). | |
| BorderRight(false). | |
| BorderHeader(false). | |
| BorderColumn(false). | |
| StyleFunc(func(row, col int) lipgloss.Style { | |
| if row == table.HeaderRow { | |
| return lipgloss.NewStyle().Bold(true).PaddingRight(3) | |
| } | |
| // Regular style for data rows with padding. | |
| return lipgloss.NewStyle().PaddingRight(3) | |
| }) | |
| var headers []string | |
| for _, col := range columns { | |
| if !col.hide { | |
| headers = append(headers, col.name) | |
| } | |
| } | |
| t.Headers(headers...) | |
| for _, row := range rows { | |
| values := []string{ | |
| row.id, | |
| row.name, | |
| row.platforms, | |
| row.createdHuman, | |
| row.size, | |
| row.inUse, | |
| row.store, | |
| row.machine, | |
| } | |
| var filteredValues []string | |
| for i, v := range values { | |
| if !columns[i].hide { | |
| filteredValues = append(filteredValues, v) | |
| } | |
| } | |
| t.Row(filteredValues...) | |
| } |
I think we need to use lipgloss table or something similar from the bubble tea world to correctly print styled data. Perhaps create our own convenient table abstraction wrapping one of those and use it for all CLI commands to make the formatting consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to use lipgloss table or something similar from the bubble tea world to correctly print styled data.
Yeah, that fixed it. I just re-used the setup from image ls for this PR, but...
Perhaps create our own convenient table abstraction wrapping one of those and use it for all CLI commands to make the formatting consistent.
I'll do this as a follow-up PR with lipgloss.table. I was also thinking the abstraction might give us a good opportunity to add JSON output support, too, since that should be a simple second "formatter" for the table data these all generate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, a draft of the "table abstraction" idea (though it ended up more of a JSON output option with some table helpers): #191
5018b5a to
890cb99
Compare
cmd/uncloud/ps.go
Outdated
| md.Append("machines", machineIP) | ||
| } | ||
| } | ||
| listCtx := metadata.NewOutgoingContext(ctx, md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JFYI you can also use this helper to create a context for broadcasting:
Lines 62 to 98 in abea3e2
| // ProxyMachinesContext returns a new context that proxies gRPC requests to the specified machines. | |
| // If namesOrIDs is nil, all machines are included. | |
| func ProxyMachinesContext( | |
| ctx context.Context, cli MachineClient, namesOrIDs []string, | |
| ) (context.Context, MachineMembersList, error) { | |
| // TODO: move the machine IP resolution to the proxy router to allow setting machine names and IDs in the metadata. | |
| machines, err := cli.ListMachines(ctx, nil) | |
| if err != nil { | |
| return nil, nil, fmt.Errorf("list machines: %w", err) | |
| } | |
| var proxiedMachines MachineMembersList | |
| var notFound []string | |
| for _, nameOrID := range namesOrIDs { | |
| if m := machines.FindByNameOrID(nameOrID); m != nil { | |
| proxiedMachines = append(proxiedMachines, m) | |
| } else { | |
| notFound = append(notFound, nameOrID) | |
| } | |
| } | |
| if len(notFound) > 0 { | |
| return nil, nil, fmt.Errorf("machines not found: %s", strings.Join(notFound, ", ")) | |
| } | |
| if len(namesOrIDs) == 0 { | |
| proxiedMachines = machines | |
| } | |
| md := metadata.New(nil) | |
| for _, m := range proxiedMachines { | |
| machineIP, _ := m.Machine.Network.ManagementIp.ToAddr() | |
| md.Append("machines", machineIP.String()) | |
| } | |
| return metadata.NewOutgoingContext(ctx, md), proxiedMachines, nil | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out! I've updated this to use ProxyMachinesContext.
cmd/uncloud/ps.go
Outdated
| continue | ||
| } | ||
| if msc.Metadata.Error != "" { | ||
| return nil, fmt.Errorf("list containers on machine %s: %s", msc.Metadata.Machine, msc.Metadata.Error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't fail here as this will make uc ps command useless if there is a machine that is down or temporarily unavailable. What we can do instead is print a warning(s) about unavailable machine(s) but still list the containers on the available machines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed to this a PrintWarning (like the volume command uses) and continue.
dc014dd to
4e6c3b4
Compare
psviderski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Just one comment to address an edge case when the cluster consists of only one machine 👍
cmd/uncloud/ps.go
Outdated
| var containers []containerInfo | ||
| for _, msc := range machineContainers { | ||
| if msc.Metadata == nil { | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this handling of broadcasted requests is very confusing and I have plans to rework it to simplify significantly. But for now, we need to still handle Metadata that is nil which could be a valid case when the cluster consists of only one machine. See
Lines 123 to 140 in 458d282
| // Metadata can be nil if the request was broadcasted to only one machine. | |
| if mc.Metadata == nil && len(machineContainers) > 1 { | |
| return svc, errors.New("something went wrong with gRPC proxy: metadata is missing for a machine response") | |
| } | |
| if mc.Metadata != nil && mc.Metadata.Error != "" { | |
| // TODO: return failed machines in the response. | |
| fmt.Printf("WARNING: failed to list containers on machine '%s': %s\n", | |
| mc.Metadata.Machine, mc.Metadata.Error) | |
| continue | |
| } | |
| machineID := "" | |
| if mc.Metadata == nil { | |
| // ListServiceContainers was proxied to only one machine. | |
| for _, v := range machineIDByManagementIP { | |
| machineID = v | |
| break | |
| } |
Please test that it works with one-machine cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I've added similar logic from service.go to the usage in ps.go now.
I tested it against a single machine ucind cluster locally.
I also added some unit tests on the ps.go#collectContainers for these various edge cases. It relies heavily on mocking the data returned to it, though, and not sure how you feel about that testing approach. Let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, thank you so much for updating the Metadata handling! 🙏
Regarding these mocked tests, I'm not a big fan of them as it's hard to replicate all the intricacies of the dependent components and keep them up to date with the changes in those components. For example, I'm going to change how the grpc router works which will slightly change the Metadata values. Then we will need to update all the mocks to reflect the change. Hopefully LLMs will help to do it correctly.
That's why I've been implementing mainly e2e tests so far to test the real workflows. Although, these are the slowest tests and they started to become a bit flaky recently. We will likely need to do something about it soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And it's fine to keep the tests you added, appreciate your effort!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding these mocked tests, I'm not a big fan of them as it's hard to replicate all the intricacies of the dependent components and keep them up to date with the changes in those components.
Absolutely agree.
That's why I've been implementing mainly e2e tests so far to test the real workflows. Although, these are the slowest tests and they started to become a bit flaky recently. We will likely need to do something about it soon.
Yeah, these really should be e2e tests, but I recalled you mentioning not really wanting to add more cluster setups in the e2e tests, too.
I looked at adding it to an existing e2e test, but they're oriented more around the functionality being tested rather than the cluster configuration needed, and it felt wrong to add these to e.g. the exec e2e tests...
Maybe we just need to change how the e2e tests are structured/grouped. Start with with "context" then the "functionality"? For example:
- in a one machine cluster
- `exec`
- does foo
- `ps`
- does bar
- in a two machine cluster
- `dns`
- ...
- with a suspect node
- `cmd`
- ...
Something like that to let us reuse the heavy, time-consuming cluster setup across as many different tests as possible. Once they are up, the individual test cases can be very quick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, these really should be e2e tests, but I recalled you mentioning not really wanting to add more cluster setups in the e2e tests, too.
That's a tradeoff between performance and good structure. The fewer clusters we create when running the entire test suit, the faster it will run. And we want it to be faster because this will let us run it more often when developing locally and catching bugs earlier. Also, I'm not too sure how close we are to reaching the memory limit on the free GitHub CI agents we're running it now as every cluster consumes a few hundreds MB of memory.
So I was postponing any substantial work on e2e tests until they break, and they still work, not perfectly but work 😄
Maybe we just need to change how the e2e tests are structured/grouped. Start with with "context" then the "functionality"?
Yeah, that's a good option! I don't remember if we have any test suits that use different clusters or they always 1-to-1. It would be great if we don't need to split tests for one logical feature into multiple different files/suits.
I remember that we have a few cases that can't be run in parallel (e.g. depending on setting env vars) so this is something to keep in mind when restructuring test suits.
I've just asked Claude if it's possible to have something like pytest fixtures (Python) in Go to inject a cluster as a dependency. It suggested an interesting approach with a lazy cluster pool. It will allow to structure tests based on the feature/concept they test but also share clusters of the required configuration between them. We can even split the current long TestXXX suits into more scoped ones for easier management.
Plan: Shared Cluster Fixtures for E2E Tests
Goal
Reduce ucind cluster creation from 10 clusters to 2 shared clusters (one 3-machine, one 1-machine) while maintaining the ability to run individual tests in isolation when needed.
Recommended Approach: TestMain with Lazy Cluster Pools
Use Go's TestMain to manage shared cluster fixtures at the package level, with lazy initialization and fallback support for isolated test runs.
Architecture
test/e2e/
├── main_test.go # NEW: TestMain + cluster pool management
├── cluster_test.go # Modified: use shared cluster
├── service_test.go # Modified: use shared cluster
├── ...
Key Design Decisions
- Two shared cluster pools:
- shared3Machine - for tests requiring 3 machines (8 tests currently)
- shared1Machine - for tests requiring 1 machine (2 tests currently)
- Lazy initialization with sync.Once: Clusters are created on first request, not upfront. This means running a single test still works.
- Cleanup at package exit: TestMain handles cluster teardown after m.Run() completes.
- Fallback support: If TEST_CLUSTER_NAME env var is set, use that cluster (existing behavior preserved).
Trade-offs
Pros:
- Significant reduction in test setup time and resources
- Tests still runnable in isolation (lazy init creates cluster on demand)
- Backwards compatible with TEST_CLUSTER_NAME env var
Cons:
- Tests must be careful about cleanup (already mostly true)
- Test failures could potentially affect other tests sharing the cluster
- Slightly more complex test infrastructure
… for service containers
… error immediately, in `uc ps`
9a7b410 to
86a8928
Compare
Implementation for #56.