roachtest: extend github issue poster to include ip node mapping#153962
Conversation
aca08a3 to
a1c5008
Compare
@srosenberg from your comment here #153955 (comment) |
There was a problem hiding this comment.
I'd also (selfishly) like it to save the output of this mapping as a log. My general flow for investigating roachtests is that github info is just for cursory glances but once I start doing a deep dive (e.g. need to start matching ips to things in cockroach.log), I generally just stick to TeamCity and wouldn't want to go back to github to find the mapping.
| Cluster Node to Ip Mapping: | ||
|
|
||
| ``` | ||
| Name DNS Private IP Public IP Machine Type CPU Arch CPU Family Provision Model |
There was a problem hiding this comment.
Suggestion: can we put nodeID, Private IP, Public IP?, and Machine Type as the first four fields? i.e. so I can see what I care about 99% of the time without scrolling?
Arch is already listed in the github issue (and shouldn't differ across machines in a test). The average engineer debugging a roachtest probably doesn't care about CPU Family or Provision Model either.
There was a problem hiding this comment.
Oh, just saw your other comment about Stan mentioning something similar. My argument against including all the fields is that this is just a convenience mapping so we don't want to overload it with information that normally isn't pertinent to investigation. Remember that we already gather vm_specs which has already has all of this information already.
There was a problem hiding this comment.
We definitely don't need the DNS column, and the Name column can be replaced by the Node column; nodes are always canonically named, i.e., n1, n2, etc. Thus, the most important are the three columns as illustrated in [1].
There was a problem hiding this comment.
redid my approach, no more log parsing. reference the new description for the new table. I'm open to changing the node names btw, there's a bullet in the description with my thoughts on it
pkg/cmd/roachtest/test_runner.go
Outdated
| } | ||
|
|
||
| // gatherNodeIpMapping attempts to gather cluster node to ip map from | ||
| // _runner-logs/cluster-create |
There was a problem hiding this comment.
The VM information should already be available through clusterImpl -> getCachedCluster() -> cloud.Cluster -> VMs. Was there a reason to opt for parsing the logs here which might break if we change the output of the logs?
There was a problem hiding this comment.
yea no good reason 😅 was mainly reading Cluster Interface's methods and the test_runner.go's cluster creation code path so didn't realize there was all this built out utility already in clusterImpl, redoing my approach will undraft the PR when done
|
I put out #154164 so hopefully 🤞 you should be able to test your changes without creating actual github issues. |
| var reproRE = regexp.MustCompile(`(?s)( *rsg_test.go:\d{3}: To reproduce, use schema:)`) | ||
|
|
||
| var roachtestNodeFatalRE = regexp.MustCompile(`(?ms)\A(.*?\n)((?:^F\d{6}\b[^\n]*(?:\n|$))+)`) | ||
| var roachtestNodeToIpRE = regexp.MustCompile(`(?m)^[[:space:]]*\|(?:[^|\n]*\|)+[[:space:]]*\n^[[:space:]]*\|(?:[[:space:]]*:?-{3,}:?[[:space:]]*\|)+[[:space:]]*\n(?:^[[:space:]]*\|(?:[^|\n]*\|)+[[:space:]]*(?:\n|$))+`) |
There was a problem hiding this comment.
Nit: this regex is an eye sore. A comment with an example would be helpful.
There was a problem hiding this comment.
// nodeToIpRoachtestRE matches an entire Markdown table block
// - including header, separator, and data rows:
// - handles an arbitrary number of columns
//
// | Node | Private IP | Public IP | Machine Type |
// | --- | --- | --- | --- |
// | node-0001 | 1.1.1.0 | 1.1.1.1 | n2-standard-4 |
var nodeToIpRoachtestRE = regexp.MustCompile(...)
| for _, row := range data[1:] { | ||
| if len(data[0]) != len(row) { | ||
| return "", errors.Errorf("row %d has %d columns, expected %d", | ||
| len(data)-1, len(row), len(data[0])) |
There was a problem hiding this comment.
len(data)-1 is not the ith row.
| out, err := ToMarkdownTable(tt.input) | ||
| if err != nil { | ||
| if tt.shouldSucceed { | ||
| t.Errorf("ToMarkdownTable() failed when expected to succeed: error = %v", err) |
There was a problem hiding this comment.
Might as well check for the expected err since the above would also output the last row index in case of (column number) mismatch.
There was a problem hiding this comment.
added dedicated error vars and checking in the type in the unit test
pkg/cmd/roachtest/test_runner.go
Outdated
|
|
||
| output := fmt.Sprintf("%s\ntest artifacts and logs in: %s", failureMsg, t.ArtifactsDir()) | ||
| params := getTestParameters(t, issueInfo.cluster, issueInfo.vmCreateOpts) | ||
| // githubMsg may contain additional info not needed in output |
There was a problem hiding this comment.
Nit: the "not needed" modifier is confusing; the two outputs are mutually disjoint. Perhaps renaming output to stdoutMsg would suffice to disambiguate?
There was a problem hiding this comment.
// Construct stdoutMsg which will be shouted and githubMsg which will
// be passed to github.MaybePost to be formatted in the github issue
// body
Renamed some vars to make what's going on more clear, so now failureMsg will be the base for both stdoutMsg & githubMsg
pkg/cmd/roachtest/test_runner.go
Outdated
| t.L().Printf("Attempting to gather ip node mapping") | ||
| ipNodeMapOut, err := gatherNodeIpMapping(t, c) | ||
| if err != nil { | ||
| if joinErr := errors.Join(inspectArtifactsErr, err); joinErr != nil { |
There was a problem hiding this comment.
err != nil implies that errors.Join(inspectArtifactsErr, err) != nil. The nested if is redundant.
There was a problem hiding this comment.
makes sense, removed the redundant check
471fd68 to
ae22cce
Compare
pkg/cmd/roachtest/github_test.go
Outdated
| test artifacts and logs in: artifacts/roachtest/manual/monitor/test-failure/node-fatal-explicit-monitor/cpu_arch=arm64/run_1 | ||
| F250826 19:49:07.194443 3106 sql/sem/builtins/builtins.go:6063 ⋮ [T1,Vsystem,n1,client=127.0.0.1:54552,hostssl,user=‹roachprod›] 250 force_log_fatal(): ‹oops› | ||
| `) | ||
| case "error-with-ip-node-info": |
There was a problem hiding this comment.
It seems a bit off to me that we are encoding these as errors. Shouldn't this be just be an additional string appended to the message? That way we can test how ip node info interacts with other failures, e.g. node-fatal, which also seems like it shouldn't be an error?
There was a problem hiding this comment.
It seems a bit off to me that we are encoding these as errors.
Yea, I agree it's all a bit convoluted, and not just in the unit test logic. I think it's because the message parameter is conflated as an error message and not as just a generic message.
So before my changes, github.MaybePost and subsequently createPostRequest messsage parameter was just the error message. So when I added the node-fatal, it sort of made sense that it was it's own error so it's additional information i.e. the node logs could just be appended with the error. But with the ip node logic, that thinking sort of breaks down because it's additional information that we want to just append to every type of message or error message.
I think a better approach to all of this would be to change the issues.PostRequest struct to take in an additional []string field in addition to message, something like info that would contain the fatal node logs, and the ip node info, etc. Then in unitTestFormatterTyp.Body, could parse each element in info. Basically, it doesn't seem like issues was originally written to pass in "additional info", only the error.
As for the unit test logic, the error message message is just being formed from the errors i.e. refError that's passed in the first step.
// See: `formatFailure` which formats failures for roachtests. Try to
// follow it here.
var b strings.Builder
for i, f := range testCase.failures {
if i > 0 {
fmt.Fprintln(&b)
}
// N.B. Don't use %+v here even though roachtest does. We don't
// want the stack trace to be outputted which will differ based
// on where this test is run and prone to flaking.
fmt.Fprintf(&b, "%v", f.squashedErr)
}
message := b.String()Then we're calling github.createPostRequest with that message. This doesn't really match what test_runner.go is doing anymore because I added the string appending logic for ipnode info to append to message before github.createPostRequest is called.
I was initially thinking that if we're just testing the format, then passing in a refError with what I wanted to format would suffice.
Instead I can add a step to pass along the ip node info so would just be something like message.append(additionalInfo) and remove the case "error-with-ip-node-info"
Long comment but just wanted to also address why I think a lot of this feels a bit "hacky" from the random string appends to the reliance on long confusing regex strings in issues
There was a problem hiding this comment.
tl;dr:
Instead I can add a step to pass along the ip node info so would just be something like message.append(additionalInfo) and remove the case "error-with-ip-node-info"
There was a problem hiding this comment.
I agree, we should refactor this because it gets even messier; i.e., more cases (and obscure regexes). Note that we already have the ExtraParams map in the PostRequest, but it serves a different purpose. Maybe another map, e.g., Diagnostics map[string]string would do the trick?
Let's write up an issue for it and address it in a subsequent PR.
There was a problem hiding this comment.
Wrote this up, mentioned it in a comment in the code in a TODO, kind of just mind dumped, if I end up doing it I have enough context, but mainly made it verbose for others / if someone else ends up doing it
| output := fmt.Sprintf("%s\ntest artifacts and logs in: %s", failureMsg, t.ArtifactsDir()) | ||
| params := getTestParameters(t, issueInfo.cluster, issueInfo.vmCreateOpts) | ||
| githubMsg := output | ||
| if testGithubMsg := t.getGithubMessage(); testGithubMsg != "" { |
There was a problem hiding this comment.
nit: why can't t.getGithubMessage just call both getGithubIpToNodeMapping and getGithubFatalLogs then do the message construction? Seems cleaner to have the test runner not be aware of the github message details.
pkg/cmd/roachtest/test_runner.go
Outdated
| // without a regex change | ||
| table = append(table, []string{"Node", "Private IP", "Public IP", "Machine Type"}) | ||
| for _, vmInstance := range cachedCluster.VMs { | ||
| table = append(table, []string{vmInstance.Name, vmInstance.PrivateIP, vmInstance.PublicIP, vmInstance.MachineType}) |
There was a problem hiding this comment.
In hindsight, we probably don't need the machine type since its static, i.e. I can just look at the test to figure it out
There was a problem hiding this comment.
Agreed, machine type is easily inferred and not really relevant in majority of (triage) cases. Let's remove.
pkg/cmd/roachtest/test_runner.go
Outdated
| testClusterLogger, err := c.l.ChildLogger("node-ips", logger.QuietStderr, logger.QuietStdout) | ||
| if err != nil { | ||
| t.L().Printf("unable to create logger %s: %s", "node-ips", err) | ||
| return "", err |
There was a problem hiding this comment.
Should be fine to swallow the error here and continue, we will at least get the mapping in the github issue if one exists
pkg/cmd/roachtest/test_runner.go
Outdated
| // Works with local and remote node clusters because we will always download | ||
| // the artifacts if there's a test failure (except for timeout) | ||
| cmd := exec.Command("grep", args...) | ||
| cmd := exec.Command(command, args...) |
There was a problem hiding this comment.
nit: change seems redundant since command is never modified, makes it marginally harder to follow
| filePattern := "logs/*unredacted/cockroach*.log" | ||
| // *unredacted captures patterns for single node and multi-node clusters | ||
| // e.g. unredacted, 1.unredacted | ||
| // * wildcard to capture patterns for single node and multi-node clusters |
There was a problem hiding this comment.
nit: not a big deal but some of these fatal node log fixes could've been their own commit
|
Running E2E (w/ dry run) bc test_runner.go was just changed, will also kick off a smoke test ci run |
1b28583 to
e715511
Compare
pkg/cmd/roachtest/github_test.go
Outdated
| case "ip-node-info": | ||
| testCase.message = fmt.Sprintf("%s\n%s", testCase.message, `| Node | Private IP | Public IP | Machine Type | | ||
| | --- | --- | --- | --- | | ||
| | teamcity-1758834520-01-n1cpu4-0001 | 10.142.0.2 | 34.139.44.53 | n2-standard-4 |`) |
There was a problem hiding this comment.
One tiny nit... Could we swap the positions of public and private ips? In most infra flakes I can think of, a public rather than private ip is implicated. So, I worry about the case where the cluster name is very long and the public ip is obscured by the scrollbar; i.e., this optimizes for (clusterName, publicIp) to be always visible, even on tiny screens :)
| r.CodeBlock("", fnr.Message) | ||
| r.Escaped("Fatal entries found in Cockroach logs:") | ||
| r.CodeBlock("", fnr.FatalLogs) | ||
| if nodeIpMap, ok := data.CondensedMessage.NodeToIpMappingRoachtest(); ok { |
There was a problem hiding this comment.
You're inside the case for FatalNodeRoachtest. How can the message be that and also NodeToIpMappingRoachtest?
There was a problem hiding this comment.
Yea so this is another consequence of doing the message parsing all in one go instead of more modular (see long comment #153962 (comment))
This is the case where we both have fatal node logs and ip node info. I went ahead with this hacky way because i wanted to keep my changes to the if else if else control chain as minimal as possible. I would have liked if all these conditionals where just if clauses instead of else if clauses, not sure if the other formatters potentially evaluate to true multiple times
I thought I had examples of these somewhere in this PR but can't find the comment but this is the corresponding datadriven test for this case https://github.com/williamchoe3/cockroach/blob/e7155110ece307284e073a7e0d3c871d830411de/pkg/cmd/roachtest/testdata/github/node_fatal_with_ip_node_info
There was a problem hiding this comment.
I see it's effectively demultiplexing. But then we have a bug in getGithubMessage which overwrites instead of appends, in the case of fatal logs?
There was a problem hiding this comment.
githubMsg := failureMsg
if githubFatalLogs := t.getGithubFatalLogs(); githubFatalLogs != "" {
githubMsg = fmt.Sprintf("%s\n%s", githubMsg, githubFatalLogs)
}
if githubIpToNodeMapping := t.getGithubIpToNodeMapping(); githubIpToNodeMapping != "" {
githubMsg = fmt.Sprintf("%s\n%s", githubMsg, githubIpToNodeMapping)
}
I'm doing a weird append, I think i did it this way because I needed the\n prepended to githubMsg, but also a \n char inbetween the two info types if they both existed
There was a problem hiding this comment.
this logic looked better to me before, but on 2nd pass it does look confusing, I can just use append if that makes things clearer
githubMsg = append(githubMsg + '\n' + githubIpToNodeMapping)
...
githubMsg = append('\n', githubMsg)
There was a problem hiding this comment.
Oh, 🤦. Ignore me, somehow I missed githubMsg in the RHS (of the assignment).
Quality of life improvement, previously to find this information, you would need to find the name of the cluster in test.log, then use that cluster name to find it's cluster creation log in artifacts_dir/_runner-logs/cluster-create. Now this information will be available in the github issue. Resolves cockroachdb#138356 Epic: None Release note: None
1528386 to
0eb1ca1
Compare
|
Another Smoke Test after addressing comments + merge conflicts https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_Nightlies_RoachtestNightlyGceBazel/20707055?hideTestsFromDependencies=false&hideProblemsFromDependencies=false |
|
tftr! bors r=srosenberg,DarrylWong |
|
Build succeeded: |

Resolves #138356
Motivation
Quality of life improvement, previously to find this information, you would need to find the name of the cluster in
test.log, then use that cluster name to find it's cluster creation log in{artifacts_dir}/_runner-logs/cluster-create.Approach
Get's node ip info from roachprod's
cloud.ClusterSimilar to #151850In the test cleanup phase ininspectArtifactsadded a new functiongatherNodeIpMappingwhich is best effort to gather the cluster information in the log.First we need to find the right log file given that there could be retries for cluster creation which creates additional log files. Given the naming convention, I opted to the just sort the log files that contained the cluster name. I realized a bit later though that because lexicographical sorting on numerical strings, this approach only works if retry attempts is <10. I was trying to avoid unnecessary string parsing if not needed. Also I would assume cluster retry attempts >=10 would be rare / too high of a retry limit to be ever set.After finding the correct file, use regex to find the string in the log. Then store it in
test_impl, then pass toissues.Postand I added a new case for parsing out the IP tableAdds Cluster Node IP information to github issue e.g.
n1. My thinking was that if people wanted to reference the vm in logs, it would be nice to keep the names consistent, but if folks want this as n1 / the vm name isn't helpful then happy to change ton1, etc.Also added a new
node-ips.logthat will also contain this tableNotes
Saw "cluster-create" being used as a magic string so created a const for it
clusterCreateDir = "cluster-create"Verification
Added datadriven test
Manual Verification
Verified the renderer in
issueswas formatting the new code block correctly in debug mode, holding off on generating more debug github issues, but can if folks want to see