Skip to content

roachprod: azure GetVMSpecs() implementation#154467

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
williamchoe3:wchoe/146202-roachprod-azure-ibm-vm-specs
Nov 6, 2025
Merged

roachprod: azure GetVMSpecs() implementation#154467
craig[bot] merged 1 commit intocockroachdb:masterfrom
williamchoe3:wchoe/146202-roachprod-azure-ibm-vm-specs

Conversation

@williamchoe3
Copy link
Copy Markdown
Contributor

@williamchoe3 williamchoe3 commented Sep 30, 2025

Informs #146202

Implementation for previously unimplemented interface method GetVMSpecs() for Azure. This will allow for cluster information to be fetched in roachtest via FetchVMSpecs

Also changed azureIDPattern to match the optional type name pairs that may exist in a fully qualified azure resource id. See comments for more details. Previously, if these optional type name pairs existed, they were just a part of the resourceName field. Currently, I didn't see any of these child resource ids being parsed, but if we were to ever parse those in the future the previous implementation wouldn't make sense.

  • Note: Although the regex selects child resource type and names in the last matching group, they are not saved to AzureId struct because they are not being used

For Azure, client.Get(...) allows you to pass in an optional expander parameter which seems to provide additional information for "instanceView" and runs an optionally passed in user script (passed in on vm creation) when "userData" is passed in.

InstanceViewTypesInstanceView InstanceViewTypes = "instanceView"  
InstanceViewTypesUserData InstanceViewTypes = "userData"

IBM PR: #155368

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@williamchoe3
Copy link
Copy Markdown
Contributor Author

williamchoe3 commented Sep 30, 2025

willchoe-1759266579-01-n1cpu4-0001.json

{
  "identity": {
    "type": "UserAssigned",
    "userAssignedIdentities": {
      "/subscriptions/73075e4f-150d-4f4f-956b-053507b61fa7/resourceGroups/rp-roachtest/providers/Microsoft.ManagedIdentity/userAssignedIdentities/rp-roachtest": {}
    }
  },
  "location": "eastus",
  "properties": {
    "hardwareProfile": {
      "vmSize": "Standard_D4ds_v5"
    },
    "networkProfile": {
      "networkInterfaces": [
        {
          "id": "/subscriptions/73075e4f-150d-4f4f-956b-053507b61fa7/resourceGroups/willchoe-1759266579-01-n1cpu4-eastus/providers/Microsoft.Network/networkInterfaces/willchoe-1759266579-01-n1cpu4-0001",
          "properties": {
            "primary": true
          }
        }
      ]
    },
    "osProfile": {
      "adminUsername": "ubuntu",
      "allowExtensionOperations": true,
      "computerName": "willchoe-1759266579-01-n1cpu4-0001",
      "linuxConfiguration": {
        "disablePasswordAuthentication": true,
        "patchSettings": {
          "assessmentMode": "ImageDefault",
          "patchMode": "ImageDefault"
        },
        "provisionVMAgent": true,
        "ssh": {
          "publicKeys": [
            {
              "keyData": "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAINPfW7BbqHu6hb0kQoGFXM1r8ZawkBOyp2a3xkMHf11J will.choe@cockroachlabs.com\n",
              "path": "/home/ubuntu/.ssh/authorized_keys"
            }
          ]
        }
      },
      "requireGuestProvisionSignal": true,
      "secrets": []
    },
    "storageProfile": {
      "dataDisks": [],
      "diskControllerType": "SCSI",
      "imageReference": {
        "offer": "0001-com-ubuntu-server-jammy",
        "publisher": "Canonical",
        "sku": "22_04-lts-gen2",
        "version": "22.04.202312060"
      },
      "osDisk": {
        "caching": "ReadWrite",
        "createOption": "FromImage",
        "deleteOption": "Detach",
        "diskSizeGB": 32,
        "managedDisk": {
          "id": "/subscriptions/73075e4f-150d-4f4f-956b-053507b61fa7/resourceGroups/willchoe-1759266579-01-n1cpu4-eastus/providers/Microsoft.Compute/disks/willchoe-1759266579-01-n1cpu4-0001_disk1_ca3e8d8fee804935a3eb41a3b95f837f",
          "storageAccountType": "StandardSSD_LRS"
        },
        "name": "willchoe-1759266579-01-n1cpu4-0001_disk1_ca3e8d8fee804935a3eb41a3b95f837f",
        "osType": "Linux"
      }
    }
  },
  "tags": {
    "arch": "amd64",
    "cluster": "willchoe-1759266579-01-n1cpu4",
    "created": "2025-09-30T21:09:48Z",
    "lifetime": "12h0m0s",
    "roachprod": "true",
    "test_name": "roachtest-manual-success",
    "test_owner": "test-eng",
    "test_run_id": "will-choe-1759266579",
    "usage": "roachtest"
  },
  "zones": [
    "1"
  ]
}

@williamchoe3 williamchoe3 force-pushed the wchoe/146202-roachprod-azure-ibm-vm-specs branch from da5b3e4 to 6b3dedb Compare September 30, 2025 21:49
@williamchoe3 williamchoe3 marked this pull request as ready for review September 30, 2025 21:49
@williamchoe3 williamchoe3 requested a review from a team as a code owner September 30, 2025 21:49
@williamchoe3 williamchoe3 requested review from herkolategan and shailendra-patel and removed request for a team September 30, 2025 21:49
@williamchoe3
Copy link
Copy Markdown
Contributor Author

err i have no idea why i'm getting diffs in pkg/cmd/roachtest/test_runner.go
just using the file from git checkout upstream/master -- pkg/cmd/roachtest/test_runner.go 🤔

Copy link
Copy Markdown
Contributor

@DarrylWong DarrylWong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets change it to informs instead of resolves, since the issue is still outstanding for ibm

})

// Manual test for verifying framework behavior in a test success scenario
r.Add(registry.TestSpec{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's good that you're testing this e2e but it seems generic enough you don't really need to commit it. Running any roachtest with --always-collect-artifacts on azure would do the trick as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanted an easy / basically "no op" test to use when running things E2E. I was using one of the http validation tests before but thought this would make my workflow simpler. I'd like to leave this in because I feel like i'd resuse this again, but if you feel strongly about it i can take it out. Also if there's a way to run the framework in a "test no op" mode, that works for me too

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but if you feel strongly about it i can take it out

Just a nit, I don't feel that strongly about it if you think it improves your workflow.

Also if there's a way to run the framework in a "test no op" mode

I usually just run one of the acceptance tests. Something like acceptance/build-info takes about 5 seconds longer to run than a noop test and goes through spinning up a CRDB cluster; doesn't really matter for your change here but it might catch a bug in other changes.

// disableIssue disables posting GitHub issues for test failures.
disableIssue bool
// dryRunIssuePosting enables dry-run mode for GitHub issue posting.
dryRunIssuePosting bool
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a rebase off of master?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes ty!

l *logger.Logger, vms vm.List,
) (map[string]map[string]interface{}, error) {
return nil, nil
// Boiler Plate
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: kind of a redundant comment, if you see an opportunity to reduce the boiler plate then go for it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

o oops that was also from my notes, i'll take out the comment forgot to clean it up, as for the boiler plate, it's 2 function calls to setup the client, i think it's fine it just looked verbose to me at first but i think it's just go being go

azureVmId, err := parseAzureID(vmInstance.ProviderID)
if err != nil {
return nil, err
} else if azureVmId.resourceGroup == "" {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on other usages of parseAzureID it seems safe to assume this won't be empty? If not then it makes more sense to move the empty string checking inside parseAzureID itself.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did some research, those fields are all required for a valid Azure Resource ID so yea it makes sense to do the validation in the struct constructor

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking the regex:

/subscriptions/([^/]+)/resourceGroups/([^/]+)/providers/([^/]+)/([^/]+)/([^/]+)(?:/(.*))?)`

And it seems like all the fields are guaranteed to be non-empty (assuming FindStringSubmatch doesn't return nil) except for the final child resource group. So we just need to check if that one is empty or not.

That being said, I'm not really sure what the child resource type does? Do we need it? Or can we just leave a comment explaining why we drop it?

Copy link
Copy Markdown
Contributor Author

@williamchoe3 williamchoe3 Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw that code is outdated, I took out the AzureID conditional checks from GetVMSpecs and all the error handling regarding the AzureID happens in parseAzureID now

And it seems like all the fields are guaranteed to be non-empty ... except for the final child resource group

Yes, that is my understanding too

So we just need to check if that one is empty or not

Check the final child resource group / the last matching group of the regex? Not currently doing a check if it's present, but just saving it to azureID.childResourceTypeNamePairs, and assuming for now a future caller can check if it's present or not

type azureID struct {
	// childResourceTypeNamePairs does not have its name pairs parsed into
	// struct fields because if present, the number of pairs is ambiguous
	childResourceTypeNamePairs string
	provider                   string
	resourceGroup              string
	resourceName               string
	resourceType               string
	subscription               string
}

If found, the child resource type name pairs will be in the childResourceTypeNamePairs field, but currently I'm not using it in GetVMSpecs nor is it used anywhere else afaik. I only added this for "correctness" because previously the child resource would've just been appended to the resourceName field i.e.
resourceName = lb01/frontendIPConfigurations/ipconfig1
for AzureID /subscriptions/1111/resourceGroups/rg1/providers/Microsoft.Network/loadBalancers/lb01/frontendIPConfigurations/ipconfig1 for the old regex
\/subscriptions\/(.+)\/resourceGroups\/(.+)\/providers\/(.+?)\/(.+?)\/(.+) which isn't correct

why we drop it

not currently dropped, but could also do that too, just did a bit of future proofing incase we need it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed redundant validation check in parseAzureID and removed childResourceTypeNamePairs field from azureID struct because not currently used

added comments to explain

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

manually verified provisioning / GetVmSpecs still works as expected

@williamchoe3 williamchoe3 force-pushed the wchoe/146202-roachprod-azure-ibm-vm-specs branch 2 times, most recently from 132b246 to b11bb0d Compare September 30, 2025 22:12
azureVmId, err := parseAzureID(vmInstance.ProviderID)
if err != nil {
return nil, err
} else if azureVmId.resourceGroup == "" {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else ifs are redundant since control exits in each case. Humans prefer straight-line code.

return nil, errors.Errorf("resource name not found in azure id: %s", azureVmId.String())
}

// expand with InstanceViewTypesInstanceView and InstanceViewTypesUserData
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misplaced comment? The line immediately below it just creates a client.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops those were my notes for code that no longer exists

@williamchoe3
Copy link
Copy Markdown
Contributor Author

after looking into Azure resource IDs more while thinking about the error handling, noticed that the parsing and struct wouldn't work with child resource ids, so made the pattern more flexible and adjusted the struct. see description and code comments for more details

@williamchoe3 williamchoe3 marked this pull request as draft October 3, 2025 16:04
@williamchoe3 williamchoe3 marked this pull request as ready for review October 7, 2025 14:26
})

// Manual test for verifying framework behavior in a test success scenario
r.Add(registry.TestSpec{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but if you feel strongly about it i can take it out

Just a nit, I don't feel that strongly about it if you think it improves your workflow.

Also if there's a way to run the framework in a "test no op" mode

I usually just run one of the acceptance tests. Something like acceptance/build-info takes about 5 seconds longer to run than a noop test and goes through spinning up a CRDB cluster; doesn't really matter for your change here but it might catch a bug in other changes.

azureVmId, err := parseAzureID(vmInstance.ProviderID)
if err != nil {
return nil, err
} else if azureVmId.resourceGroup == "" {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking the regex:

/subscriptions/([^/]+)/resourceGroups/([^/]+)/providers/([^/]+)/([^/]+)/([^/]+)(?:/(.*))?)`

And it seems like all the fields are guaranteed to be non-empty (assuming FindStringSubmatch doesn't return nil) except for the final child resource group. So we just need to check if that one is empty or not.

That being said, I'm not really sure what the child resource type does? Do we need it? Or can we just leave a comment explaining why we drop it?

name string
input string
expectErr bool
want azureID
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: expected and actual is more idiomatic for our codebase

Copy link
Copy Markdown
Contributor

@DarrylWong DarrylWong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't hurt to kick off a SELECT_PROBABILITY=0.1 smoke test on TC since the regex pattern changed, but otherwise lgtm.

@williamchoe3 williamchoe3 force-pushed the wchoe/146202-roachprod-azure-ibm-vm-specs branch from 25d5dc4 to bff1d3c Compare October 27, 2025 14:48
@williamchoe3
Copy link
Copy Markdown
Contributor Author

@williamchoe3
Copy link
Copy Markdown
Contributor Author

williamchoe3 commented Nov 4, 2025

Smoke Test Failure

[00:55:31]W:	 [Step 2/2] main.(*testImpl).Fatal(0xc1a265c608?, {0xc08cde2880?, 0xc002695720?, 0xc6e5b90540?})
[00:55:31]W:	 [Step 2/2] 	pkg/cmd/roachtest/test_impl.go:427 +0x49
[00:55:31]W:	 [Step 2/2] main.(*clusterImpl).Conn(0xc1a265c608, {0xae839a8?, 0xc002695720?}, 0x0?, 0xc00237d860?, {0x0?, 0x0?, 0x0?})
[00:55:31]W:	 [Step 2/2] 	pkg/cmd/roachtest/cluster.go:2945 +0xac
[00:55:31]W:	 [Step 2/2] main.(*testRunner).postTestAssertions.func2.4(0xc0062b0b60, 0xc1a265c608, {0xae839a8, 0xc002695720}, 0x4, 0x8bb2c97000, 0xc6f3a7c480)
[00:55:31]W:	 [Step 2/2] 	pkg/cmd/roachtest/test_runner.go:1782 +0x10a
[00:55:31]W:	 [Step 2/2] main.(*testRunner).postTestAssertions.func2({0xae839a8, 0xc002695720})
[00:55:31]W:	 [Step 2/2] 	pkg/cmd/roachtest/test_runner.go:1794 +0x658
[00:55:31]W:	 [Step 2/2] github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func1({0xae839a8, 0xc002695720})
[00:55:31]W:	 [Step 2/2] 	pkg/util/stop/stopper.go:440 +0x8a
[00:55:31]W:	 [Step 2/2] created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx in goroutine 116
[00:55:31]W:	 [Step 2/2] 	pkg/util/stop/stopper.go:438 +0x137

Looks like postTestAssertions's RunAsyncTaskEx fataled when creating a SQL connection pkg/cmd/roachtest/cluster.go:2945 +0xac causing the entire roachtest invocation to panic

If my understanding is correct, this doesn't seem to be great, but it looks unrelated to GetVMSpecs because postTestAssertions is called before teardownTest which is where GetVMSpecs eventually get's called

			if err := r.postTestAssertions(ctx, t, c, 10*time.Minute); err != nil {
				l.Printf("error during post test assertions: %v; see test-post-assertions.log for details", err)
			}
...
	replaceLogger("test-teardown")
	if err := r.teardownTest(ctx, t, c, timedOut); err != nil {
		l.PrintfCtx(ctx, "error during test teardown: %v; see test-teardown.log for details", err)
	}
	if err := r.insp

Will re-run the smoke
https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_Nightlies_RoachtestNightlyAzureBazel/20694753?hideTestsFromDependencies=false&hideProblemsFromDependencies=false

@DarrylWong
Copy link
Copy Markdown
Contributor

The unrecovered panic should be handled by #155981 which this PR doesn't include.

@williamchoe3
Copy link
Copy Markdown
Contributor Author

Ah thanks, will rebase

Implementation for previously unimplemented interface method `GetVMSpecs()` for
Azure. This will allow for cluster information to be fetched in roachtest via
`FetchVMSpecs`

Informs cockroachdb#146202

Epic: None
Release note: None
@williamchoe3 williamchoe3 force-pushed the wchoe/146202-roachprod-azure-ibm-vm-specs branch from bff1d3c to 67e3bed Compare November 4, 2025 21:21
@williamchoe3
Copy link
Copy Markdown
Contributor Author

@williamchoe3
Copy link
Copy Markdown
Contributor Author

tftr!

bors r=DarrylWong

craig bot pushed a commit that referenced this pull request Nov 6, 2025
154467: roachprod: azure GetVMSpecs() implementation r=DarrylWong a=williamchoe3

Informs #146202 

Implementation for previously unimplemented interface method `GetVMSpecs()` for Azure. This will allow for cluster information to be fetched in roachtest via `FetchVMSpecs`

Also changed `azureIDPattern` to match the optional type name pairs that may exist in a fully qualified azure resource id. See comments for more details.  Previously, if these optional type name pairs existed, they were just a part of the `resourceName` field. Currently, I didn't see any of these child resource ids being parsed, but if we were to ever parse those in the future the previous implementation wouldn't make sense.
* Note: Although the regex selects child resource type and names in the last matching group, they are not saved to AzureId struct because they are not being used

For Azure, `client.Get(...)` allows you to pass in an optional `expander` parameter which seems to provide additional information for `"instanceView"` and runs an optionally passed in user script (passed in on vm creation) when `"userData"` is passed in.
*  `"userData"` doesn't seem relevant for us "retrieves the UserData property as part of the VM model view that was provided by the user during the VM Create/Update operation" https://learn.microsoft.com/en-us/rest/api/compute/virtual-machines/get?view=rest-compute-2025-04-01&tabs=HTTP
  *  "The `instanceView` of an Azure Virtual Machine (VM) provides a snapshot of the runtime properties and status of the VM" https://learn.microsoft.com/en-us/rest/api/compute/virtual-machines/instance-view?view=rest-compute-2025-04-01&tabs=HTTP Seems nice, but there's no diff in the json returned by `"instanceView"` and the default when `expander` is not set (referred to `model view` in some places on thier docs) 
```go
InstanceViewTypesInstanceView InstanceViewTypes = "instanceView"  
InstanceViewTypesUserData InstanceViewTypes = "userData"
```


IBM PR: #155368


155368: roachprod: implement GetVMSpecs for IBM r=DarrylWong,golgeek a=williamchoe3

Informs #146202

Implementation for previously unimplemented interface method GetVMSpecs() for IBM. This will allow for cluster information to be fetched in roachtest via FetchVMSpecs
* API: https://cloud.ibm.com/apidocs/resource-controller/resource-controller?code=go#get-resource-instance 

Also added comments to help contextualize some parts that I thought would be helpful while doing some debug.

See comment for e.g. vm specs


156504: sql: enable test tenants r=yuzefovich a=yuzefovich

**sql: remove some redundant kv.TestingIsRangeLookupRequest calls in tests**

These were redundant because we have GetRequests in all call sites.

**sql: enable test tenants**

Some tests have been adjusted to work with test tenants, others have specific issues to investigate further.

Fixes: #143114
Epic: CRDB-48945

156547: sql: add ColumnGeneratedAsIdentity as a new schema change element r=shghasemi a=shghasemi

Previously, GeneratedAsIdentity was part of the Column element.
Creating a new schema change element will allow the declarative schema changer
to add/remove/update such columns without dropping the Column element.

Epic: [CRDB-31283](https://cockroachlabs.atlassian.net/browse/CRDB-31283)

Part of: #142918

Release note: None

156940: util/taskset: add generic work distribution mechanism r=spilchen a=spilchen

Adds a new taskset package that provides task coordination for a number of workers. TaskSet hands out integer identifiers (TaskIDs) that workers claim and process, with the caller responsible for mapping TaskIDs to actual work (file indices, key ranges, batches, etc.).

Features round-robin initial distribution across workers and locality when getting a task ID. Not thread-safe; callers provide external synchronization.

This will be used by the new distributed merge pipeline.

Closes #156578

Epic: CRDB-48845
Release note: none

Co-authored by: `@jeffswenson`

156987: sql: remove now-stale growstack for internal executor r=yuzefovich a=yuzefovich

As of 4a5c503, we now grow stack unconditionally for all async tasks. Some time ago we added an ability to growstack in the async task of the internal executor when used in the LDR, and now that has become redundant, so we remove this special case.

Epic: None
Release note: None

Co-authored-by: William Choe <williamchoe3@gmail.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Shadi Ghasemitaheri <shadi.ghasemitaheri@cockroachlabs.com>
Co-authored-by: Matt Spilchen <matt.spilchen@cockroachlabs.com>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Nov 6, 2025

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Nov 6, 2025

@craig craig bot merged commit 353e4b9 into cockroachdb:master Nov 6, 2025
33 of 34 checks passed
rishabh7m added a commit to crltest-issue-syncing/cockroach-fork that referenced this pull request Nov 20, 2025
rishabh7m added a commit to crltest-issue-syncing/cockroach-fork that referenced this pull request Nov 21, 2025
rishabh7m added a commit to crltest-issue-syncing/cockroach-fork that referenced this pull request Nov 21, 2025
rishabh7m added a commit to crltest-issue-syncing/cockroach-fork that referenced this pull request Dec 1, 2025
rishabh7m added a commit to crltest-issue-syncing/cockroach-fork that referenced this pull request Dec 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants