Skip to content

flaky build failure: "remote wall time is too far ahead" #962

@davepacheco

Description

@davepacheco

Here's where we saw it:
https://github.com/oxidecomputer/omicron/runs/6133927246?check_suite_focus=true#step:11:1956

error: failed to run custom build command for `nexus-test-utils v0.1.0 (/Users/runner/work/omicron/omicron/nexus/test-utils)`

Caused by:
  process didn't exit successfully: `/Users/runner/work/omicron/omicron/target/debug/build/nexus-test-utils-08a85905696fbfd8/build-script-build` (exit status: 101)
  --- stdout
  cargo:rerun-if-changed=build.rs
  cargo:rerun-if-changed=../../common/src/sql/dbinit.sql
  cargo:rerun-if-changed=../../tools/cockroachdb_checksums
  cargo:rerun-if-changed=../../tools/cockroachdb_version

  --- stderr
  Apr 22 19:42:29.902 INFO cockroach temporary directory: /tmp/omicron_tmp/.tmpCY01Eg
  Apr 22 19:42:29.902 INFO cockroach command line: cockroach start-single-node --insecure --http-addr=:0 --store /Users/runner/work/omicron/omicron/target/debug/build/nexus-test-utils-4b01e832e8f96d40/out/crdb-base --listen-addr 127.0.0.1:0 --listening-url-file /tmp/omicron_tmp/.tmpCY01Eg/listen-url
  Apr 22 19:42:42.034 INFO cockroach pid: 7411
  Apr 22 19:42:42.034 INFO cockroach listen URL: postgresql://root@127.0.0.1:49430/omicron?sslmode=disable
  Apr 22 19:42:42.034 INFO cockroach: populating
  thread 'main' panicked at 'failed to populate database: populate

  Caused by:
      0: populating Omicron database
      1: db error: ERROR: polling for queued jobs to complete: poll-show-jobs: remote wall time is too far ahead (1.732092s) to be trustworthy
      2: ERROR: polling for queued jobs to complete: poll-show-jobs: remote wall time is too far ahead (1.732092s) to be trustworthy', /Users/runner/work/omicron/omicron/test-utils/src/dev/mod.rs:150:35
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
  WARN: dropped CockroachInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
  WARN: temporary directory leaked: /tmp/omicron_tmp/.tmpCY01Eg
warning: build failed, waiting for other jobs to finish...
error: build failed
Error: Process completed with exit code 101.

This message appears to come from CockroachDB, which has a poll-show-jobs thing and a component with this error message. What's weird is that this is a single-node CockroachDB cluster and there's only one system here. My first thought was maybe the system clock jumped during the test, but that seems unlikely. I wonder if this is a symptom of CPU starvation due to GitHub Actions workers being starved. That is: maybe CockroachDB gets a timestamp on the client, gets one from the server, and compares them, and hits this message when they're too far apart. On a single system, you could still fail that check if you were stuck off-CPU for a while between the calls to get timestamps.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Test FlakeTests that work. Wait, no. Actually yes. Hang on. Something is broken.developmentBugs, paper cuts, feature requests, or other thoughts on making omicron development better

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions