Poking around on dogfood, I noted blueprint execution was reporting success but taking several minutes:
task: "blueprint_executor"
configured period: every 1m
currently executing: iter 1003, triggered by a periodic timer firing
started at 2024-12-09T20:10:42.700Z, running for 2497ms
last completed activation: iter 1002, triggered by a periodic timer firing
started at 2024-12-09T20:06:21.942Z (263s ago) and ran for 260744ms
target blueprint: d08ce52e-7af3-4caa-9505-d695236c18b0
execution: enabled
status: completed (16 steps)
error: (none)
Looking at the status report for one execution, it's the three steps that send disks, datasets, and zones to each sled agent that are taking almost all of the time (steps 3-5):
root@oxz_switch0:~# omdb nexus background-tasks print-report blueprint_executor
note: Nexus URL not specified. Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:1122:3344:104::3]:12221
[Dec 09 20:15:05] Running ( 1/16) Ensure external networking resources
[Dec 09 20:15:05] Completed ( 1/16) Ensure external networking resources: after 64.24ms
[Dec 09 20:15:05] Running ( 2/16) Fetch sled list
[Dec 09 20:15:05] Completed ( 2/16) Fetch sled list: after 17.27ms
[Dec 09 20:15:05] Running ( 3/16) Deploy physical disks
[Dec 09 20:16:18] Completed ( 3/16) Deploy physical disks: after 73.64s
[Dec 09 20:16:18] Running ( 4/16) Deploy datasets
[Dec 09 20:17:46] Completed ( 4/16) Deploy datasets: after 87.63s
[Dec 09 20:17:46] Running ( 5/16) Deploy Omicron zones
[Dec 09 20:19:15] Completed ( 5/16) Deploy Omicron zones: after 88.76s
[Dec 09 20:19:15] Running ( 6/16) Plumb service firewall rules
[Dec 09 20:19:15] Completed ( 6/16) Plumb service firewall rules: after 470.23ms
[Dec 09 20:19:15] Running ( 7/16) Ensure dataset records
[Dec 09 20:19:29] Completed ( 7/16) Ensure dataset records: after 13.77s
[Dec 09 20:19:29] Running ( 8/16) Deploy DNS records
[Dec 09 20:19:29] Completed ( 8/16) Deploy DNS records: after 210.87ms
[Dec 09 20:19:29] Running ( 9/16) Cleanup expunged zones
[Dec 09 20:19:29] Completed ( 9/16) Cleanup expunged zones: after 17.28µs
[Dec 09 20:19:29] Running (10/16) Decommission sleds
[Dec 09 20:19:29] Completed (10/16) Decommission sleds: after 5.01µs
[Dec 09 20:19:29] Running (11/16) Decommission expunged disks
[Dec 09 20:19:29] Completed (11/16) Decommission expunged disks: after 88.53ms
[Dec 09 20:19:29] Running (12/16) Deploy clickhouse cluster nodes
[Dec 09 20:19:29] Completed (12/16) Deploy clickhouse cluster nodes: after 4.17µs
[Dec 09 20:19:29] Running (13/16) Deploy single-node clickhouse cluster
[Dec 09 20:19:30] Completed (13/16) Deploy single-node clickhouse cluster: after 240.05ms
[Dec 09 20:19:30] Running (14/16) Reassign sagas
[Dec 09 20:19:30] Completed (14/16) Reassign sagas: after 37.94ms
[Dec 09 20:19:30] Running (15/16) Ensure CockroachDB settings
[Dec 09 20:19:30] Completed (15/16) Ensure CockroachDB settings: after 7.91ms
[Dec 09 20:19:30] Running (16/16) Finalize and check for errors
[Dec 09 20:19:30] Completed (16/16) Finalize and check for errors: after 5.96µs
Grepping out the start and stop for each PUT request for the "Deploy physical disks" step in one execution, we don't have one outlying sled; rather, every sled takes a handful of seconds (which is surprising!), and we don't currently parallelize this step:
2024-12-09 20:02:01.632Z DEBG 20b100d0-84c3-4119-aa9b-0c632b0b6a3a/13246 (ServerContext) on oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a: client request
2024-12-09 20:02:03.956Z DEBG 20b100d0-84c3-4119-aa9b-0c632b0b6a3a/13246 (ServerContext) on oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a: client response
2024-12-09 20:02:03.994Z DEBG 20b100d0-84c3-4119-aa9b-0c632b0b6a3a/13246 (ServerContext) on oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a: client request
2024-12-09 20:02:10.293Z DEBG 20b100d0-84c3-4119-aa9b-0c632b0b6a3a/13246 (ServerContext) on oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a: client response
2024-12-09 20:02:10.479Z DEBG 20b100d0-84c3-4119-aa9b-0c632b0b6a3a/13246 (ServerContext) on oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a: client request
2024-12-09 20:02:16.788Z DEBG 20b100d0-84c3-4119-aa9b-0c632b0b6a3a/13246 (ServerContext) on oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a: client response
2024-12-09 20:02:16.828Z DEBG 20b100d0-84c3-4119-aa9b-0c632b0b6a3a/13246 (ServerContext) on oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a: client request
2024-12-09 20:02:24.219Z DEBG 20b100d0-84c3-4119-aa9b-0c632b0b6a3a/13246 (ServerContext) on oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a: client response
2024-12-09 20:02:24.256Z DEBG 20b100d0-84c3-4119-aa9b-0c632b0b6a3a/13246 (ServerContext) on oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a: client request
2024-12-09 20:02:31.144Z DEBG 20b100d0-84c3-4119-aa9b-0c632b0b6a3a/13246 (ServerContext) on oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a: client response
2024-12-09 20:02:31.183Z DEBG 20b100d0-84c3-4119-aa9b-0c632b0b6a3a/13246 (ServerContext) on oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a: client request
2024-12-09 20:02:40.839Z DEBG 20b100d0-84c3-4119-aa9b-0c632b0b6a3a/13246 (ServerContext) on oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a: client response
2024-12-09 20:02:41.015Z DEBG 20b100d0-84c3-4119-aa9b-0c632b0b6a3a/13246 (ServerContext) on oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a: client request
2024-12-09 20:02:48.313Z DEBG 20b100d0-84c3-4119-aa9b-0c632b0b6a3a/13246 (ServerContext) on oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a: client response
2024-12-09 20:02:48.353Z DEBG 20b100d0-84c3-4119-aa9b-0c632b0b6a3a/13246 (ServerContext) on oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a: client request
2024-12-09 20:02:57.234Z DEBG 20b100d0-84c3-4119-aa9b-0c632b0b6a3a/13246 (ServerContext) on oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a: client response
2024-12-09 20:02:57.273Z DEBG 20b100d0-84c3-4119-aa9b-0c632b0b6a3a/13246 (ServerContext) on oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a: client request
2024-12-09 20:03:01.811Z DEBG 20b100d0-84c3-4119-aa9b-0c632b0b6a3a/13246 (ServerContext) on oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a: client response
2024-12-09 20:03:01.850Z DEBG 20b100d0-84c3-4119-aa9b-0c632b0b6a3a/13246 (ServerContext) on oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a: client request
2024-12-09 20:03:08.276Z DEBG 20b100d0-84c3-4119-aa9b-0c632b0b6a3a/13246 (ServerContext) on oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a: client response
2024-12-09 20:03:08.313Z DEBG 20b100d0-84c3-4119-aa9b-0c632b0b6a3a/13246 (ServerContext) on oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a: client request
2024-12-09 20:03:15.376Z DEBG 20b100d0-84c3-4119-aa9b-0c632b0b6a3a/13246 (ServerContext) on oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a: client response
2024-12-09 20:03:15.415Z DEBG 20b100d0-84c3-4119-aa9b-0c632b0b6a3a/13246 (ServerContext) on oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a: client request
2024-12-09 20:03:24.249Z DEBG 20b100d0-84c3-4119-aa9b-0c632b0b6a3a/13246 (ServerContext) on oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a: client response
Poking around on dogfood, I noted blueprint execution was reporting success but taking several minutes:
Looking at the status report for one execution, it's the three steps that send disks, datasets, and zones to each sled agent that are taking almost all of the time (steps 3-5):
Grepping out the start and stop for each
PUTrequest for the "Deploy physical disks" step in one execution, we don't have one outlying sled; rather, every sled takes a handful of seconds (which is surprising!), and we don't currently parallelize this step: