Skip to content

M-DASH: Fix eval dashboard history preservation and reliability#7

Merged
MarkEdmondson1234 merged 4 commits into
devfrom
fix/eval-dashboard-reliability
Oct 16, 2025
Merged

M-DASH: Fix eval dashboard history preservation and reliability#7
MarkEdmondson1234 merged 4 commits into
devfrom
fix/eval-dashboard-reliability

Conversation

@MarkEdmondson1234

Copy link
Copy Markdown
Member

Summary

Fixes the eval dashboard reliability issues documented in design_docs/planned/eval-dashboard-reliability.md.

Problem: Running ailang eval-report --format=json was destroying historical dashboard data because it regenerated history from scratch by scanning disk, losing deleted baselines.

Solution: Implemented read-modify-write pattern with history preservation, validation, and atomic writes.

Changes

M1: History Preservation (4h)

  • ✅ Added DashboardJSON and HistoryEntry types with validation
  • ✅ Implemented loadExistingDashboard() to read current JSON
  • ✅ Implemented mergeHistory() with duplicate detection
  • ✅ Updated ExportBenchmarkJSON() to preserve history
  • ✅ 7 comprehensive tests (100% coverage)

M2: Validation & Atomic Writes (2h)

  • ✅ Added DashboardJSON.Validate() method
  • ✅ Implemented writeJSONAtomic() with temp file pattern
  • ✅ Validation catches: missing version, missing timestamp, duplicate versions
  • ✅ Atomic rename ensures all-or-nothing writes

M3: Baseline Metadata Fixes (1h)

  • ✅ Updated tools/eval_baseline.sh to require explicit VERSION
  • ✅ Separated version from git_describe in baseline.json
  • ✅ Removed cached success_count (calculated dynamically)
  • ✅ Updated Makefile to require EVAL_VERSION parameter

Documentation & Dashboard

  • ✅ Updated CLAUDE.md with dashboard workflow warnings
  • ✅ Regenerated dashboard with v0.3.9 data (all 22 benchmarks)

Test Results

$ go test ./internal/eval_analysis/...
ok      github.com/sunholo/ailang/internal/eval_analysis    0.234s
coverage: 89.7% of statements

All 7 new tests passing:

  • TestHistoryPreservation
  • TestDuplicateVersionUpdate
  • TestMissingHistoryCreation
  • TestValidation
  • TestAtomicWrites
  • TestAtomicWritesValidationFailure
  • TestBuildHistoryEntry

Before & After

Before (v0.3.9):

$ ailang eval-report ... --format=json > latest.json
# Lost v0.3.8, v0.3.7-1, v0.3.6-24-mini from history
# History: 2 versions (v0.3.9, v0.3.9-alpha1)

After (v0.3.10):

$ ailang eval-report ... --format=json
# Preserves all history
# History: 5 versions (v0.3.9, v0.3.9-alpha1, v0.3.8, v0.3.7-1, v0.3.6-24-mini)

Verification

Tested with existing v0.3.9 baseline:

$ ailang eval-report eval_results/baselines/v0.3.9 v0.3.9 --format=json
Loading results from eval_results/baselines/v0.3.9...
Loaded existing dashboard with 5 history entries
Generating performance matrix...
Generating json report...
# ✅ History preserved: 5 → 5 versions
# ✅ All 22 benchmarks included
# ✅ No data loss

Files Changed

  • internal/eval_analysis/types.go - Core data structures
  • internal/eval_analysis/export_docusaurus.go - History preservation logic
  • internal/eval_analysis/export_docusaurus_test.go - Comprehensive tests
  • cmd/ailang/eval_tools.go - CLI integration
  • tools/eval_baseline.sh - Baseline creation fixes
  • Makefile - VERSION requirement enforcement
  • CLAUDE.md - Workflow documentation
  • docs/static/benchmarks/latest.json - Updated dashboard data
  • design_docs/20251016/M-DASH.md - Design documentation

Breaking Changes

None - backward compatible with existing baselines.

Next Steps

  1. Merge to dev
  2. Test with v0.3.10 baseline creation
  3. Monitor dashboard updates in production

🤖 Generated with Claude Code

Co-Authored-By: Claude noreply@anthropic.com

MarkEdmondson1234 and others added 4 commits October 16, 2025 13:00
Problem: `ailang eval-report --format=json` destroyed historical data
by regenerating JSON from scratch, losing versions not found on disk.

Solution: Read existing dashboard → merge history → write atomically.

Changes:
- Load existing dashboard before writing (history preservation)
- Merge new version into history (update if exists, append if new)
- Atomic writes with validation (temp file + rename)
- Added DashboardJSON type with validation

Implementation:
- internal/eval_analysis/types.go: DashboardJSON + HistoryEntry types
- internal/eval_analysis/export_docusaurus.go: History preservation logic
- cmd/ailang/eval_tools.go: Pass output path to ExportBenchmarkJSON
- Tests: 7 new tests (100% coverage for new code)

Impact:
✅ Running eval-report twice preserves history (5 versions → 5 versions)
✅ Rerunning same version updates entry (no duplicates)
✅ Validation prevents corrupted JSON
✅ Atomic writes prevent partial writes

Milestones completed:
- ✅ M1: History preservation (4h)
- ✅ M2: Validation + atomic writes (2h)

Remaining:
- ⏳ M3: Baseline metadata fixes (tools/eval_baseline.sh)

See: design_docs/20251016/M-DASH.md

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Added detailed documentation for the new history-preserving dashboard
update workflow (M-DASH). Emphasizes that dashboard updates now happen
automatically via ailang eval-report --format=json without manual steps.

Key additions:
- ✅ Dashboard preserves history automatically
- ✅ Validation + atomic writes built-in
- ❌ Don't redirect stdout (bypasses preservation logic)
- ❌ Don't manually edit latest.json

This reinforces the eval-orchestrator agent's workflow and prevents
users/AI from trying to reinvent dashboard update scripts.

Related: M-DASH (design_docs/20251016/M-DASH.md)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Problem: baseline.json had wrong version (git describe) and cached
wrong success_count (20 vs actual 74 in v0.3.9).

Solution: Require explicit VERSION, separate git_describe, remove
cached success_count (calculate dynamically from result files).

Changes:
1. tools/eval_baseline.sh:
   - Require explicit VERSION env var (no git describe default)
   - Add git_describe as separate field
   - Remove success_count from baseline.json (calculated dynamically)
   - Improved error messages

2. Makefile:
   - eval-baseline now requires EVAL_VERSION parameter
   - Clear error message if missing

3. internal/eval_analysis/loader.go:
   - Already calculates success_count dynamically (lines 123-134)
   - No changes needed (already correct!)

4. CLAUDE.md:
   - Updated eval-baseline examples to show required parameter

Impact:
✅ VERSION must be explicit (e.g., EVAL_VERSION=v0.3.10)
✅ baseline.json has separate version and git_describe fields
✅ success_count always accurate (calculated from result files)
❌ No more "v0.3.7-46-g2cfa80a" version confusion
❌ No more cached wrong stats (20 vs 74)

Usage:
  make eval-baseline EVAL_VERSION=v0.3.10
  VERSION=v0.3.10 ./tools/eval_baseline.sh

Completes M-DASH milestone (all 3 parts done).

Related: design_docs/20251016/M-DASH.md

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Regenerated dashboard using new history-preserving workflow.

Changes:
- Now includes api_call_json and json_encode benchmarks
- Updated stats: 126 runs, 58.7% success rate
- History preserved (5 versions intact)
- All 22 benchmarks now in dashboard data

Generated with: ailang eval-report eval_results/baselines/v0.3.9 v0.3.9 --format=json

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@MarkEdmondson1234 MarkEdmondson1234 merged commit 03d3e20 into dev Oct 16, 2025
7 checks passed
@MarkEdmondson1234 MarkEdmondson1234 deleted the fix/eval-dashboard-reliability branch October 16, 2025 11:22
sunholo-voight-kampff added a commit that referenced this pull request Mar 13, 2026
Root cause: lookupPrefix() iterated Go map nondeterministically when
duplicate namespace prefixes mapped to same URI (common in EPUB/OOXML).
Fix: check default namespace first before map iteration.

Performance: String() methods on ListValue, ArrayValue, TupleValue,
RecordValue, TaggedValue used += concatenation (O(n²)). Switched to
strings.Builder. Pre-allocated slices in evalCoreList/Array/Tuple and
XML attribute parsing. Zero-allocation whitespace check for CharData.

Result: Moby Dick EPUB parse 62s → 11.5s (5.4x speedup).

Process: added determinism verification as sprint-executor principle #9
and builtin-developer validation rule #7.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sunholo-voight-kampff added a commit that referenced this pull request Mar 24, 2026
SKILL.md:
- Rule #7: publish auto-rewrites path deps (new)
- Publishing checklist section (new)
- Registry validator section (new)
- Updated error table with duplicate key error

error_solutions.md:
- TOML duplicate key error (from ailang install bug)
- Publishing with path deps guidance
- Dependency order example

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sunholo-voight-kampff added a commit that referenced this pull request May 8, 2026
… 10 integration gaps

Today's live smoke testing of v0.18.0's M-MOTOKO-EXECUTOR-ADAPTER
surfaced 10 interconnected gaps that prevent trustworthy benchmark
numbers. Three got partial fixes during the day (HealthCheck no-spawn,
MOTOKO_REPO fallback, MOTOKO_HEADLESS, run_summary-before-done reorder)
but root causes remain across both repos. User feedback: "we need it
all I think. lets get to the bottom of the gaps - I think a design
doc process will help."

This sprint sequences the fixes properly:

  Phase 1: Investigation-first for gap #1 (run_summary not reaching
    disk on success path) — debug:checkpoint markers + bisect.
    Non-negotiable; writing a fix without the cause is gambling.

  Phase 2: motoko-side fixes (gap #1 root-cause fix + #6 extension
    visibility + #7 --headless flag + #8 --version mode + #10 TS
    process.exit removal so emission ordering doesn't matter)

  Phase 3: AILANG-side fixes (gap #2 success-criteria fallback to
    thinking.finish_reason + #5 MOTOKO_REPO discovery from wrapper)

  Phase 4: Cross-cutting (gap #4 session_id unification — adapter
    canonical, TS wrapper honors, AILANG runtime emits matching)

  Phase 5: Config layer (gap #3 + #9 cost_rates source-of-truth in
    models.yml.pricing → env-var override of motoko's profile config)

  Phase 6: End-to-end validation — TestEndToEnd_FullResultPopulation
    asserts every Result field; M5 paired-comparison
    motoko-claude-haiku-4-5 vs claude-haiku-4-5 produces real numbers.

Architectural posture: eliminate fragile assumptions at every layer.
Today's adapter assumes things that aren't true (wrapper preserves
session_id, cost_rates configured, run_summary always reaches disk,
loaded_extensions field accurate). After this hardening, none of those
assumptions remain — each replaced with explicit observable contracts.

Net axiom score: +13 (no hard violations). Strong A2 (replayability —
captured runs are fully reproducible), A7 (machines first — Result
fields mechanically reliable), A9 (cost visibility — eliminates $0
reporting gap).

Estimated 3 working days, ~530 LOC including tests, across both repos.
GATING for M5 of v0.18.0 (threshold-measurement) and v0.19.0
M-MOTOKO-EXT-PER-TASK (which needs accurate session_ids + extension
visibility from this hardening).

Cross-references:
- v0.18.0 M-MOTOKO-EXECUTOR-ADAPTER Future Work updated to point at
  this hardening as the trustworthy-numbers prerequisite
- v0.19.0 M-MOTOKO-EXT-PER-TASK Dependencies updated to mark v0.18.1
  as BLOCKING (was just "after local validation")

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
sunholo-voight-kampff added a commit that referenced this pull request May 8, 2026
…design docs

Phase 6 of v0.18.1 hardening sprint.

Moves both design docs from design_docs/planned/v0_18_1/ to
design_docs/implemented/v0_18_1/ and updates their status headers to
"Implemented (2026-05-08)" with cross-repo commit references.

Adds the v0.18.1 entry to changelogs/v0.10-current.md covering all
five phases:
  - Phase 1 (gap #1): JSONL drain race in TS layer
  - Phase 2 (gaps #6, #7, #8): extensions visibility, --headless, --version
  - Phase 3 (gaps #2, #5): success fallback, MOTOKO_REPO discovery
  - Phase 4 (gap #4): session_id unification
  - Phase 5 (gaps #3, #9): cost rates env-var passthrough

Acceptance gate: 5 of 7 conditions met; the remaining 2 (CostUSD>0
end-to-end + smoke success) blocked on a separate Bedrock validation
issue (extension tool names with `/` fail Anthropic's
^[a-zA-Z0-9_-]{1,128}$ pattern). The pricing env-var plumbing is
verified by unit tests; live smoke needs the extension fix downstream.

LOC tally: ~80 AILANG-side + ~250 motoko-side + 11 new tests across
both repos, in ~6 hours wall-clock vs the 3-day plan estimate.

Sprint retrospective: investigation-first paid off — the 12 debug:
checkpoint markers in Phase 1 directly identified the silent-exit
point as the TS process.exit-on-done race, which would have been
maddening to find by code-reading alone. The resulting fix was tiny
(~25 LOC across 2 TS files) but unblocked everything downstream.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
sunholo-voight-kampff added a commit that referenced this pull request May 9, 2026
Arni's PR #6 review (with Opus 4.6's analysis) flagged that motoko_agent's
ailang.toml/ailang.lock had absolute /Users/mark/dev/... paths baked in,
making the lockfile non-portable and breaking any external clone.

The actual fix shipped on motoko-bisect-gap1 / PR #7 (commit f105af2):
swap path-based deps for registry versions — same packages, all already
published.

This commit adds two things to extension-packages.md so future readers
won't fall into the same trap:

1. A note immediately after the host ailang.toml example explaining when
   to use registry vs path — and warning that path is a dev-loop tool,
   not a release-ready format.

2. A new "Path vs registry checklist" section with concrete jq/ailang
   commands to verify the lockfile before opening a PR.

The example ailang.toml now uses fully-qualified registry refs
("sunholo/motoko_ext_abi" = "1.0.0") to match what users will actually
write — the previous bare-name form ("motoko-ext-abi") didn't include
the registry namespace.

Refs: PR arniwesth/motoko_agent#6 (review by arniwesth + Opus 4.6 analysis)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
sunholo-voight-kampff added a commit that referenced this pull request May 9, 2026
The Typical usage block in std/ai.stepWithStream's docstring showed:

  let render = \chunk ->
    match chunk { ... }

which is doubly wrong:
  - AILANG lambda syntax is \x. body, not \x -> body (the latter is the
    type-arrow + match-arm syntax)
  - match-in-lambda hits a known parser bug (see
    design_docs/planned/v0_13_0/m-dx-match-in-hof-block-lambda.md)

Replaced with the top-level-func pattern that examples/runnable/
ai_streaming.ail actually uses. Tracks reality and demonstrates the
parser-bug workaround in one place.

Caught while wiring stepWithStream into motoko_agent
(arniwesth/motoko_agent PR #7) — the original syntax doesn't compile.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant