Adding support for test failure detection in valkey#24
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR introduces a complete Test Failure Detector workflow for Valkey CI that monitors the Daily CI pipeline, downloads test-failure artifacts, deduplicates failures by test identity, and manages labeled GitHub issues with environment tracking and recurrence comments across runs. ChangesTest Failure Detector
Sequence Diagram(s)sequenceDiagram
participant Workflow as GitHub Actions
participant Main as main.py
participant Download as download.py
participant Parser as parse_failures.py
participant Render as issue_renderer.py
participant IssueDedup as IssueDedupPublisher
Workflow->>Main: run(token, repo, run_id, dry_run)
Main->>Download: get_latest_daily_run()
Download-->>Main: WorkflowRun
Main->>Download: download_all_test_failures()
Download-->>Main: all_test_failures bytes
Main->>Download: get_job_urls()
Download-->>Main: job_urls map
Main->>Parser: parse_and_deduplicate()
Parser-->>Main: list[UniqueFailure]
alt dry_run mode
Main-->>Workflow: job_summary (failures, 0 issues)
else normal mode
Main->>Render: render_for(), merge_environments()
loop per UniqueFailure
Main->>IssueDedup: upsert(fingerprint, render, body_transform)
IssueDedup-->>Main: created|updated|skipped
end
Main-->>Workflow: job_summary (failures, created, updated)
end
Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (6)
.github/workflows/test-failure-detector-sweep.yml (5)
39-39: 💤 Low valueConsider pinning checkout and setup-python actions to commit SHAs.
Static analysis flags the use of tag-based action references (
@v6) instead of commit SHAs. While you've already pinnedcreate-github-app-tokento a SHA (line 55), thecheckoutandsetup-pythonactions remain unpinned.Pinning to commit SHAs provides supply-chain attack protection but requires more maintenance. This is a repository-wide policy decision.
Also applies to: 45-45
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/test-failure-detector-sweep.yml at line 39, Replace tag-based GitHub Action references with pinned commit SHAs for supply-chain safety: update the uses entries that reference "actions/checkout@v6" and "actions/setup-python@v4" to their corresponding full commit SHA values (matching the versions you intend to use), ensuring the workflow still functions and updating any dependent inputs if necessary; verify the pinned SHAs are from upstream actions/checkout and actions/setup-python repositories and run the workflow to confirm no regressions.
24-36: ⚡ Quick winClarify permissions intent or remove unused grant.
The workflow-level
issues: write(line 26) is overridden by the job-level permissions block (lines 35-36), which grants onlycontents: read. Since the detector uses the GitHub App token (TARGET_TOKEN) for issue operations rather than the defaultGITHUB_TOKEN, the workflow-levelissues: writegrant is unused.Either remove
issues: writefrom lines 24-26, or add a comment explaining that the App token (notGITHUB_TOKEN) handles issue operations.📝 Option 1: Remove unused permission
permissions: contents: read - issues: write📝 Option 2: Document the intent
permissions: contents: read - issues: write + issues: write # Unused by workflow; detector uses GitHub App token for cross-repo issue ops🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/test-failure-detector-sweep.yml around lines 24 - 36, The workflow defines a workflow-level permissions block granting "issues: write" but the job "detect-failures" overrides permissions to only "contents: read" and the detector actually uses the GitHub App token (TARGET_TOKEN) for issue operations, so remove the unused workflow-level "issues: write" entry or add a clear comment above the workflow-level permissions explaining that issue writes are handled by the App token (TARGET_TOKEN) and not by GITHUB_TOKEN; update the "permissions" block or add the comment near the top-level permissions and keep the job-level permissions in the "detect-failures" job as-is to avoid confusion.
66-66: 💤 Low valueConsider simplifying the DRY_RUN expression for clarity.
The current expression
${{ github.event_name == 'workflow_dispatch' && inputs.dry_run == true && 'true' || 'false' }}is correct but hard to parse. The logic ensures dry-run mode is only honored for manual dispatch (scheduled runs always create issues).Consider splitting the logic or adding a comment to document the intent.
♻️ Alternative: add explanatory comment
+ # dry_run is only honored for manual dispatch; scheduled runs always create issues DRY_RUN: ${{ github.event_name == 'workflow_dispatch' && inputs.dry_run == true && 'true' || 'false' }}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/test-failure-detector-sweep.yml at line 66, The DRY_RUN expression is hard to read; replace the current expression with a ternary for clarity and add a short comment: set DRY_RUN to ${{ github.event_name == 'workflow_dispatch' && inputs.dry_run == 'true' ? 'true' : 'false' }} (and add a one-line comment above explaining "only honor inputs.dry_run for manual workflow_dispatch runs; scheduled runs always create issues"). Reference the DRY_RUN env variable and the existing github.event_name and inputs.dry_run symbols when making the change.
13-16: ⚡ Quick winUse
type: numberfor the run_id input.The
run_idinput represents a workflow run ID, which is numeric. GitHub Actions supportstype: numberfor workflow_dispatch inputs, which better documents the expected value type thantype: string.♻️ Proposed fix
run_id: description: "Specific workflow run ID to analyze (empty = latest completed Daily run)" required: false - type: string - default: "" + type: numberNote: With
type: number, omitting the input will passnullinstead of"", so remove the empty string check on line 76.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/test-failure-detector-sweep.yml around lines 13 - 16, Change the workflow_dispatch input "run_id" to use type: number instead of type: string by updating the run_id input definition (replace type: string with type: number); then remove or adjust the code that checks for an empty string value for run_id (the check that compares run_id == ""), since with type: number an omitted input will be null rather than ""; ensure any subsequent logic expects a number or null and handle null appropriately.
29-34: 💤 Low valueConsider adding a job timeout for defense in depth.
The job currently has no
timeout-minutesset, relying on the GitHub Actions default (360 minutes). While this workflow is expected to complete in minutes (artifact download + parsing + API calls), adding a modest timeout (e.g., 30-60 minutes) would catch hangs or API degradation scenarios.🛡️ Proposed addition
detect-failures: name: "Detect test failures" + timeout-minutes: 60 if: >🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/test-failure-detector-sweep.yml around lines 29 - 34, Add a defensive job timeout to the detect-failures job by setting the timeout-minutes property (e.g., 30 or 60) under the detect-failures job definition so the workflow won't hang indefinitely; update the job block that defines name: "Detect test failures" / runs-on: ubuntu-latest to include timeout-minutes: 30 (or 60) directly beneath runs-on.tests/test_issue_manager.py (1)
109-115: ⚡ Quick winAdd a regression case for bodies without an
Environmentsline.Current coverage only proves replacement works when the marker is already present. Please add a case where an existing issue body has no
**Environments:**section, then verifyprocess_failures()appends the new environments instead of silently leaving the body unchanged.Also applies to: 121-220
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_issue_manager.py` around lines 109 - 115, Add a regression test for when the issue body lacks an existing "**Environments:**" line: create a new test (e.g., in class TestUpdateEnvironments) that calls _update_environments_in_body with a body like "Some text\nMore text" and environments ["job-a", "job-b"] and assert the returned body now contains a new "**Environments:** `job-a`, `job-b`" line while preserving the original text; also add a parallel case calling process_failures() (or the higher-level wrapper used in tests) to ensure it appends the environments instead of leaving the body unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/test_failure_detector/download.py`:
- Around line 135-143: The ZIP extraction currently picks the first .json entry
and lets ZIP errors propagate; change it to explicitly look for
"all-test-failures.json" (instead of using json_name = next(...)) and only read
that member via zf.read("all-test-failures.json"); catch and handle invalid
archives and missing members (zipfile.BadZipFile, KeyError, IOError) by logging
a warning like "Artifact missing or corrupt: all-test-failures.json" and
returning None so malformed/corrupt cross-repo inputs are treated as misses;
update the block around zipfile.ZipFile, names, json_name and zf.read
accordingly.
In `@scripts/test_failure_detector/main.py`:
- Around line 72-77: When get_latest_daily_run(gh, repo_full_name,
workflow_name, branch) returns None the function currently logs and returns 1
without emitting the CI summary; before returning add a call to
emit_job_summary(...) to record the failure path (include relevant context like
workflow_name, repo_full_name, branch and the error message), e.g. invoke
emit_job_summary(...) right after logger.error("No qualifying workflow run
found.") so the summary is always produced on this early exit; ensure the
function signature and any required parameters for emit_job_summary are provided
from the surrounding scope.
In `@scripts/test_failure_detector/manage_issues.py`:
- Around line 25-42: The ensure_label_exists() bootstrap races when two
processes both see the label missing and one create call fails; wrap the
repo.create_label call (the retry_github_call invocation that uses
repo.create_label and _LABEL_NAME) in a try/except that catches GithubException
from the create and ignores it when it indicates the label already exists (e.g.,
status codes 409/422 or an "already exists" message), logging that another
process created the label, and re-raise the exception for any other status; keep
using retry_github_call for the actual create and only handle the
conflict/duplicate-create error around that call.
- Around line 105-108: The current _update_environments_in_body(body, all_envs)
replaces an existing "**Environments:**" line but does nothing if that marker is
missing; change it to either replace the marker when present or append a new
"**Environments:** ..." line when absent so new environments are persisted for
migrated/manual issues. Update the function _update_environments_in_body to
check for the marker (using the same regex or search) and if found use re.sub as
now, otherwise append the constructed new_env_line (with proper spacing/newline)
in a sensible location (e.g., end of body or before a known section header) so
callers that invoke it (the code that updates matched issues) will always
persist the updated all_envs list.
In `@scripts/test_failure_detector/parse_failures.py`:
- Around line 46-48: Check that the top-level artifact all_failures is a dict
before iterating (the loop using "for job_name, suites in
all_failures.items()"); if it is not a dict (e.g., list/string/null), log an
informative message and return [] to avoid crashing the sweep; update the code
around where grouped: dict[str, UniqueFailure] = {} and the for loop to perform
an isinstance(all_failures, dict) guard, emit a clear error/log including the
unexpected type/value, and return [] for invalid payloads.
---
Nitpick comments:
In @.github/workflows/test-failure-detector-sweep.yml:
- Line 39: Replace tag-based GitHub Action references with pinned commit SHAs
for supply-chain safety: update the uses entries that reference
"actions/checkout@v6" and "actions/setup-python@v4" to their corresponding full
commit SHA values (matching the versions you intend to use), ensuring the
workflow still functions and updating any dependent inputs if necessary; verify
the pinned SHAs are from upstream actions/checkout and actions/setup-python
repositories and run the workflow to confirm no regressions.
- Around line 24-36: The workflow defines a workflow-level permissions block
granting "issues: write" but the job "detect-failures" overrides permissions to
only "contents: read" and the detector actually uses the GitHub App token
(TARGET_TOKEN) for issue operations, so remove the unused workflow-level
"issues: write" entry or add a clear comment above the workflow-level
permissions explaining that issue writes are handled by the App token
(TARGET_TOKEN) and not by GITHUB_TOKEN; update the "permissions" block or add
the comment near the top-level permissions and keep the job-level permissions in
the "detect-failures" job as-is to avoid confusion.
- Line 66: The DRY_RUN expression is hard to read; replace the current
expression with a ternary for clarity and add a short comment: set DRY_RUN to
${{ github.event_name == 'workflow_dispatch' && inputs.dry_run == 'true' ?
'true' : 'false' }} (and add a one-line comment above explaining "only honor
inputs.dry_run for manual workflow_dispatch runs; scheduled runs always create
issues"). Reference the DRY_RUN env variable and the existing github.event_name
and inputs.dry_run symbols when making the change.
- Around line 13-16: Change the workflow_dispatch input "run_id" to use type:
number instead of type: string by updating the run_id input definition (replace
type: string with type: number); then remove or adjust the code that checks for
an empty string value for run_id (the check that compares run_id == ""), since
with type: number an omitted input will be null rather than ""; ensure any
subsequent logic expects a number or null and handle null appropriately.
- Around line 29-34: Add a defensive job timeout to the detect-failures job by
setting the timeout-minutes property (e.g., 30 or 60) under the detect-failures
job definition so the workflow won't hang indefinitely; update the job block
that defines name: "Detect test failures" / runs-on: ubuntu-latest to include
timeout-minutes: 30 (or 60) directly beneath runs-on.
In `@tests/test_issue_manager.py`:
- Around line 109-115: Add a regression test for when the issue body lacks an
existing "**Environments:**" line: create a new test (e.g., in class
TestUpdateEnvironments) that calls _update_environments_in_body with a body like
"Some text\nMore text" and environments ["job-a", "job-b"] and assert the
returned body now contains a new "**Environments:** `job-a`, `job-b`" line while
preserving the original text; also add a parallel case calling
process_failures() (or the higher-level wrapper used in tests) to ensure it
appends the environments instead of leaving the body unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: d2f84d0b-8c49-4872-8f1a-2308f045bf68
📒 Files selected for processing (12)
.github/workflows/test-failure-detector-sweep.ymlREADME.mddocs/architecture.mdscripts/test_failure_detector/__init__.pyscripts/test_failure_detector/download.pyscripts/test_failure_detector/main.pyscripts/test_failure_detector/manage_issues.pyscripts/test_failure_detector/parse_failures.pytests/test_artifact_fetcher.pytests/test_download.pytests/test_failure_parser.pytests/test_issue_manager.py
Signed-off-by: Bonnie Chan <bonniecv@amazon.com>
…al trigger), running with BChan-0/valkey daily ci Signed-off-by: Bonnie Chan <bonniecv@amazon.com>
valkey-io -> BChan-0/valkey Signed-off-by: Bonnie Chan <bonchan35@gmail.com> Signed-off-by: Bonnie Chan <bonniecv@amazon.com>
change line 30 to test on forked repo valkey-io/valkey -> BChan-0/valkey Signed-off-by: Bonnie Chan <bonchan35@gmail.com> Signed-off-by: Bonnie Chan <bonniecv@amazon.com>
Signed-off-by: Bonnie Chan <bonniecv@amazon.com>
Signed-off-by: Bonnie Chan <bonniecv@amazon.com>
Signed-off-by: Bonnie Chan <bonniecv@amazon.com>
Signed-off-by: Bonnie Chan <bonniecv@amazon.com>
Signed-off-by: Bonnie Chan <bonniecv@amazon.com>
Signed-off-by: Bonnie Chan <bonniecv@amazon.com>
Signed-off-by: Bonnie Chan <bonniecv@amazon.com>
Signed-off-by: Bonnie Chan <bonniecv@amazon.com>
Signed-off-by: Bonnie Chan <bonniecv@amazon.com>
Signed-off-by: Bonnie Chan <bonniecv@amazon.com>
Signed-off-by: Bonnie Chan <bonniecv@amazon.com>
Signed-off-by: Bonnie Chan <bonniecv@amazon.com>
Signed-off-by: Bonnie Chan <bonniecv@amazon.com>
Signed-off-by: Bonnie Chan <bonniecv@amazon.com>
Signed-off-by: Bonnie Chan <bonniecv@amazon.com>
Signed-off-by: Bonnie Chan <bonniecv@amazon.com>
Signed-off-by: Bonnie Chan <bonniecv@amazon.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/test-failure-detector-sweep.yml:
- Around line 39-45: Replace mutable action tags with their corresponding commit
SHAs and restrict the GitHub App token scope: change usages of
actions/checkout@v6 and actions/setup-python@v6 to the exact commit SHAs (pin to
the specific SHA for those actions), and update the
actions/create-github-app-token step to include a repositories: entry scoped to
the target repo (e.g., use the inputs.repo or 'valkey-io/valkey') so the token
is not granted access to all repos under owner: valkey-io.
- Around line 56-60: The create-github-app-token step currently provides owner:
valkey-io which yields a token scoped to all installations; update the step that
calls actions/create-github-app-token to include repositories: ${{ inputs.repo
|| 'valkey-io/valkey' }} so the generated installation token is limited to the
targeted repo (matching the later --repo "${{ inputs.repo || 'valkey-io/valkey'
}}" usage) and enforces least privilege while preserving workflow_dispatch
behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 168eb16a-d63e-43eb-b209-b25863700f12
📒 Files selected for processing (12)
.github/workflows/test-failure-detector-sweep.ymlREADME.mddocs/architecture.mdscripts/test_failure_detector/__init__.pyscripts/test_failure_detector/download.pyscripts/test_failure_detector/main.pyscripts/test_failure_detector/manage_issues.pyscripts/test_failure_detector/parse_failures.pytests/test_artifact_fetcher.pytests/test_download.pytests/test_failure_parser.pytests/test_issue_manager.py
✅ Files skipped from review due to trivial changes (3)
- docs/architecture.md
- scripts/test_failure_detector/init.py
- README.md
🚧 Files skipped from review as they are similar to previous changes (8)
- tests/test_download.py
- scripts/test_failure_detector/parse_failures.py
- scripts/test_failure_detector/main.py
- tests/test_artifact_fetcher.py
- tests/test_issue_manager.py
- scripts/test_failure_detector/download.py
- scripts/test_failure_detector/manage_issues.py
- tests/test_failure_parser.py
Signed-off-by: Bonnie Chan <bonniecv@amazon.com>
|
I will take a look! |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
.github/workflows/test-failure-detector-sweep.yml (1)
28-37: ⚡ Quick winConsider adding a timeout to prevent runaway jobs.
The job inherits the default GitHub Actions timeout of 360 minutes, but the GitHub App token expires after 60 minutes. Adding an explicit
timeout-minutes: 60(or 30) would align the job lifecycle with the token lifetime and prevent unexpected long-running scenarios.⏱️ Suggested addition
jobs: detect-failures: name: "Detect test failures" + timeout-minutes: 60 if: > (github.event_name == 'workflow_dispatch' || github.repository == 'valkey-io/valkey-ci-agent')Based on learnings: In the
valkey-io/valkey-ci-agentrepository, the fuzzer monitor workflow intentionally setstimeout-minutes: 180as a worst-case safety net, with real runs expected to complete in 10–20 minutes.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/test-failure-detector-sweep.yml around lines 28 - 37, The detect-failures job currently has no explicit timeout and can run up to the GitHub default (360 minutes); add a timeout-minutes field to the job definition (the jobs.detect-failures block) to align with the GitHub App token lifetime — e.g., add timeout-minutes: 60 (or 30) directly under the detect-failures job metadata (near name/runs-on/permissions) so the job will be forcibly cancelled if it exceeds that limit.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/test_failure_detector/download.py`:
- Around line 191-195: The normalized alias can overwrite an existing exact
job-name mapping (job_url_map) causing wrong URLs; ensure you first store the
exact mapping (job_url_map[job.name] = job.html_url) and only add the normalized
alias (normalized) if it differs from job.name and does not already exist as a
key in job_url_map (i.e., check normalized not in job_url_map before assigning
job_url_map[normalized] = job.html_url) so exact names always take precedence.
---
Nitpick comments:
In @.github/workflows/test-failure-detector-sweep.yml:
- Around line 28-37: The detect-failures job currently has no explicit timeout
and can run up to the GitHub default (360 minutes); add a timeout-minutes field
to the job definition (the jobs.detect-failures block) to align with the GitHub
App token lifetime — e.g., add timeout-minutes: 60 (or 30) directly under the
detect-failures job metadata (near name/runs-on/permissions) so the job will be
forcibly cancelled if it exceeds that limit.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 6969c79d-14c2-45cd-8195-bf8a8dfbcc6a
📒 Files selected for processing (12)
.github/workflows/test-failure-detector-sweep.ymlREADME.mddocs/architecture.mdscripts/test_failure_detector/__init__.pyscripts/test_failure_detector/download.pyscripts/test_failure_detector/main.pyscripts/test_failure_detector/manage_issues.pyscripts/test_failure_detector/parse_failures.pytests/test_artifact_fetcher.pytests/test_download.pytests/test_failure_parser.pytests/test_issue_manager.py
✅ Files skipped from review due to trivial changes (2)
- scripts/test_failure_detector/init.py
- docs/architecture.md
🚧 Files skipped from review as they are similar to previous changes (8)
- tests/test_failure_parser.py
- tests/test_download.py
- README.md
- scripts/test_failure_detector/manage_issues.py
- tests/test_issue_manager.py
- scripts/test_failure_detector/parse_failures.py
- scripts/test_failure_detector/main.py
- tests/test_artifact_fetcher.py
@hanxizh9910 Here's a test issue chain on my fork: BChan-0/valkey#7 |
@BChan-0 Thanks for this! I noticed the issue was created today (~5:43 PM Pacific), but the CI links point to runs from two days ago, is it expected? Also, could you verify the update part works as well? Specifically: when a test fails in some new jobs on the next run, does it add a new comment and update the Environments section? |
@hanxizh9910 Yes, the linked CI runs are correct. I have Daily intentionally skipping and only running environments manually for triggering dummy-flaky runs, so the latest valid test was at that time. Here's a new issue chain where you can see the environment update working more clearly. The logic is similar to 3358, just housed differently. |
You can put this example in the |
asagege
left a comment
There was a problem hiding this comment.
Some comments for README
Signed-off-by: Bonnie Chan <bonniecv@amazon.com>
Signed-off-by: Bonnie Chan <bonchan35@gmail.com>
…ing to only valkey repo Signed-off-by: Bonnie Chan <bonniecv@amazon.com>
There was a problem hiding this comment.
Disclaimer AI Review:
Findings
-
P1 duplicate issue regression: The new publisher dedups only by hidden marker (
scripts/common/issue_dedup.py:87,scripts/common/issue_dedup.py:151), but existing opentest-failureissues from the old Valkey workflow do not havevalkey-ci-agent:test-failuremarkers. I verifiedvalkey-io/valkey#3992and#3979lack the marker, and there are 21 opentest-failureissues today. First recurrence after migration will create duplicate issues instead of updating them. Add a migration/backcompat path: lookup by existing[TEST-FAILURE] ...title + label, then inject the marker on update. -
P1 not mergeable with latest
main: PR headb1f32472conflicts with latest upstreammaindb76ce22.gh pr viewreportsmergeable: CONFLICTING;git merge-treeshows conflicts inREADME.mdanddocs/architecture.md. -
P2 unsafe raw fingerprints:
scripts/test_failure_detector/issue_renderer.py:28embeds raw test names and paths into HTML comments and GitHub search queries. Long names, quotes, newlines, or comment-breaking text can break marker search/dedup. Existing shared fuzzer code uses hashed fingerprints viascripts/common/incidents.py; this detector should do the same, with the title fallback above for migration. -
P3 diff check failure:
.github/workflows/test-failure-detector-sweep.yml:66has trailing whitespace.git diff --check upstream/main...HEADfails.
Resolve README.md and docs/architecture.md by keeping both feature sets: the Test Failure Detector (this branch) and upstream's CI Fix bot, plus upstream's backport poll / mark-done / revert workflows. - README: both workflow-table rows and both feature sections; removed a duplicate Test Failure Detector section; adopted upstream's hyphen house-style in the new prose. - architecture.md: both flow sections, both layer entries and entry-point lists; Planned Workflows keeps PR Reviewer, Autonomous CI-fix poller, and Additional Daily CI Analysis. Signed-off-by: Bonnie Chan <213245020+BChan-0@users.noreply.github.com>
…ration fallback Signed-off-by: Bonnie Chan <213245020+BChan-0@users.noreply.github.com>
Signed-off-by: Bonnie Chan <213245020+BChan-0@users.noreply.github.com>
57afb88 to
34d036e
Compare
sarthakaggarwal97
left a comment
There was a problem hiding this comment.
Looks good to me mostly. Did we test if the issue dedup works with existing issues created by the existing framework?
@sarthakaggarwal97 Thanks for the review. I had implemented tests for that case based on your earlier AI Review, but I just double-checked to confirm for your comment and found that the code for it was actually stashed from my main branch when I was resolving another issue. I've committed everything back into the branch now. The dedup uses |
…printing with test suite Signed-off-by: Bonnie Chan <213245020+BChan-0@users.noreply.github.com>
#### Purpose This workflow was originally introduced in PR [#3358](#3358), where we detect the failures in our scheduled `daily` runs and create / update github issues. We want to do more things with AI with respect to tests failures. It could include potentially finding the root cause, any PR that broke the tests, some helpful dashboard to track daily tests, maybe some analysis or possible fix as well. To achieve that, we are moving this issue management out of this repository and into `valkey-ci-agent`. The Daily workflow in this repository still records per-job test failures, consolidates them into `all-test-failures.json`, and uploads the `all-test-failures` artifact. The workflow being removed here was only responsible for consuming that artifact and creating or updating GitHub issues. #### Changes Remove `.github/workflows/test-failure-detector.yml`. Issue creation and updates are now handled by the Test Failure Detector workflow in `valkey-ci-agent` through this PR [#24](valkey-io/valkey-ci-agent#24). #### Notes This should be merged together with the corresponding `valkey-ci-agent` change so scheduled test-failure detection continues without a gap. Signed-off-by: Bonnie Chan <bonchan35@gmail.com>
sarthakaggarwal97
left a comment
There was a problem hiding this comment.
Merged valkey-io/valkey#3919. Proceeding with this merge as well. Thanks @BChan-0 for the effort.
Test Failure Detector (Original: PR 3358)
Monitors the Daily CI workflow on
valkey-io/valkey, detects test failures, and automatically creates or updates GitHub issues to track them.Primary Changes from PR 3358
How it works
unstablebranch, or uses a manually input run IDall-test-failuresartifact from the CI workflow. Uses an HTTP handler to strip the Authorization header on the redirect to Azure blob storage{job → suite → [failures]}) and groups by{test_name, test_file}such that a test failing across multiple jobs becomes one unique failure with multiple job references[TEST-FAILURE] {test_name} in {test_file}) already exists: updates the environments list and adds a recurrence comment with the datetest-failurelabel, error stack trace, CI links, and environment listA GitHub Actions job summary is emitted at every exit path with a table of metrics (failures detected, issues created/updated).
Prerequisites: Cross-repo Authentication
The workflow generates a GitHub App installation token scoped to the
valkey-ioorg using the same App secrets as the backport workflow (VALKEYRIE_BOT_APP_ID+VALKEYRIE_BOT_PRIVATE_KEY). This token providesactions:read(to download artifacts) andissues:write(to create/update issues) onvalkey-io/valkey.Usage
Scheduled (automatic)
Runs daily at 23:00 UTC via cron. The workflow runs on
valkey-io/valkey-ci-agentand uses a GitHub App token to read artifacts from and write issues tovalkey-io/valkey. Valkey Daily CI runs daily at 00:00 UTC, with runs typically completing within 4-7 hours, with slight exception (from valkey-io/valkey's history of completed runs, <10 runs exceed 7 hours, with the longest lasting 10h 02m). As valkey's test suite grows, the run time for daily will increase, so attempting to capture runs at an "earliest" time would require frequent maintenance. In the other direction, the The Failure Detector's runtime will remain nearly constant (from valkey-io and forked history of completed runs, has never exceeded 30s of runtime and runner availability is less severe on valkey-ci-agent), so it is safer for cron to be closer to the start of the Daily CI workflow as opposed to the end. As such, the Test Failure Detector should always capture the current day's workflow. In any case of a skip, manual dispatch is available. Observation of runner availability will continue post-merge for confirmation of this arrangement in practice.Manual dispatch
repo— target repository to scan (default:valkey-io/valkey)run_id— specific workflow run ID to analyze (empty = latest Daily run)dry_run— parse and report only, don't create/update issuesRelevant Tests (all passing) (examples run from terminal instead of dispatch)
tests/test_testfailuredetector_download.py
tests/test_testfailuredetector_failure_parser.py
tests/test_testfailuredetector_issue_manager.py
tests/test_issue_dedup.py
See commenting with environment update function example: Issue Link