Skip to content

PREQ-4481: Diagnose which secret path causes 403 Forbidden failures#71

Merged
tomverin merged 4 commits intomasterfrom
PREQ-4481/improve-vault-403-diagnostics
Mar 4, 2026
Merged

PREQ-4481: Diagnose which secret path causes 403 Forbidden failures#71
tomverin merged 4 commits intomasterfrom
PREQ-4481/improve-vault-403-diagnostics

Conversation

@tomverin
Copy link
Copy Markdown
Contributor

@tomverin tomverin commented Mar 3, 2026

Summary

Resolves PREQ-4481 — when hashicorp/vault-action fails with a 403 Forbidden, the logs show all requested secrets but don't indicate which one caused the failure. This forces users to debug by trial-and-error.

This PR adds a diagnostic step that runs only when the Vault secrets step fails. It:

  1. Authenticates to Vault using a fresh OIDC token
  2. Checks per-path capabilities via sys/capabilities-self (no side effects — does not read actual secrets or generate dynamic credentials)
  3. Reports which specific secret path(s) are DENIED vs OK
  4. Points the user to the Vault orders repository to fix permissions

Example output

=== Diagnosing Vault secret access failure ===
Role: github-SonarSource-my-repo

  OK      development/kv/data/repox (read)
  DENIED  development/kv/data/slack
  DENIED  development/artifactory/token/SonarSource-my-repo-promoter

To fix, update the Vault policy for this repository:
https://github.com/SonarSource/re-terraform-aws-vault/tree/master/orders

Design decisions

  • Zero overhead on success: The diagnostic step only runs when the Vault secrets step fails (continue-on-error: true + if: steps.secrets.outcome == 'failure')
  • sys/capabilities-self instead of direct reads: Avoids generating dynamic secrets (Artifactory tokens, GitHub tokens, etc.) as a side effect. Only checks permissions.
  • Graceful degradation: If the diagnostic authentication itself fails (e.g., role doesn't exist), the step reports that clearly and still fails the action
  • No new dependencies: Reuses actions/github-script (already used in the replace step) and Node.js built-in fetch

Related

  • PREQ-4482 — sibling ticket, concrete 403 case from the same reporter that motivated this improvement
  • Past tickets with the same debugging pain: PREQ-3261, PREQ-385, PREQ-1145, PREQ-1208, PREQ-248

Test plan

What we tested

  • The diagnostic step correctly identifies which secret path(s) cause the 403
  • Allowed secrets show OK with their capability (e.g. read)
  • Denied secrets show DENIED
  • The fix message and link to re-terraform-aws-vault are displayed
  • The action still fails with a clear summary (e.g. Vault secrets retrieval failed — 1 path(s) denied: operations/team/re/kv/data/digicert)

How we tested

  1. Test repo: sonar-dummy — a repo with known Vault permissions defined in re-terraform-aws-vault/orders/platform-eng-xp-squad.yaml
  2. Allowed secret: development/kv/data/slack — sonar-dummy has access
  3. Denied secret: operations/team/re/kv/data/digicert — RE team only, sonar-dummy has no access
  4. Test workflow: .github/workflows/test-vault-403-diagnostics.yml — requests both secrets to trigger a 403, then runs the diagnostic step

Test workflow run

  • sonar-dummy PR #563 — draft PR with the test workflow

  • Workflow run 22661709735 — validates batched sys/capabilities-self API and OIDC fix:

    === Diagnosing Vault secret access failure ===
    Role: github-SonarSource-sonar-dummy
    
      OK      development/kv/data/slack (read)
      DENIED  operations/team/re/kv/data/digicert
    
    To fix, update the Vault policy for this repository:
    https://github.com/SonarSource/re-terraform-aws-vault/tree/master/orders
    
    Vault secrets retrieval failed — 1 path(s) denied: operations/team/re/kv/data/digicert
    

Checklist

  • Verify the diagnostic output correctly identifies denied vs allowed paths
  • Test with a repo that has a known missing secret (sonar-dummy + digicert)
  • Verify the action still fails properly (the diagnostic step calls core.setFailed())
  • Verify the existing test workflow passes (happy path — valid secrets still work with continue-on-error: true)
  • Verify the outputs.vault is still available on success

When vault-action fails with a 403 Forbidden, automatically diagnose
which specific secret path(s) are denied by checking per-path
capabilities via Vault's sys/capabilities-self endpoint.

This avoids generating dynamic secrets as a side effect and only
runs on failure, so there is zero overhead on successful runs.
@tomverin tomverin marked this pull request as ready for review March 3, 2026 14:21
@tomverin tomverin requested a review from a team as a code owner March 3, 2026 14:21
Copilot AI review requested due to automatic review settings March 3, 2026 14:21
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves debuggability of Vault secret retrieval failures (PREQ-4481) in this composite action by adding a conditional diagnostic step to identify which specific secret paths are denied when hashicorp/vault-action fails with a 403.

Changes:

  • Marks the Vault secrets step as continue-on-error and adds a follow-up diagnostic step that authenticates via OIDC and checks per-path capabilities.
  • Emits OK/DENIED status per secret path and fails the action with a clearer error summary when secrets retrieval fails.
  • Documents the new 403 (Forbidden) diagnostics behavior in the README.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
action.yaml Adds conditional diagnostics on Vault secrets failure using sys/capabilities-self to report denied paths.
README.md Adds FAQ documentation for 403 Forbidden errors and shows example diagnostic output.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tomverin added 3 commits March 4, 2026 09:30
The diagnostic step was failing with "invalid audience (aud) claim" because
we passed vaultUrl as the audience. hashicorp/vault-action uses getIDToken()
without an audience when jwtGithubAudience is not set. Use the same to match
Vault's expected bound_audiences.
- Vault API expects paths (array) not path (string) for sys/capabilities-self
- Batch all secret paths in a single request to reduce Vault load
- Parse response: data[path] for each path, fallback to data.capabilities for single-path
Copy link
Copy Markdown
Contributor Author

@tomverin tomverin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review feedback addressed

  1. paths vs path — Applied. The Vault API expects paths (array) for /sys/capabilities-self. Updated to body: JSON.stringify({ paths: secretPaths }) and adjusted response parsing to use data[secretPath] ?? data.capabilities ?? [].

  2. Batching — Applied. Replaced the per-path loop with a single batched request. All secretPaths are sent in one call, reducing Vault load and speeding up diagnostics when many secrets are requested.

  3. Integration test — Addressed via sonar-dummy. The failure path is covered by sonar-dummy PR #563, which runs a workflow that intentionally requests an unauthorized secret (operations/team/re/kv/data/digicert) and asserts the diagnostic step emits OK/DENIED output. See the Test plan section in the PR description for links to the workflow run.

@tomverin tomverin merged commit 3d5c87c into master Mar 4, 2026
2 checks passed
@tomverin tomverin deleted the PREQ-4481/improve-vault-403-diagnostics branch March 4, 2026 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants