M-DASH: Fix eval dashboard history preservation and reliability#7
Merged
Conversation
Problem: `ailang eval-report --format=json` destroyed historical data by regenerating JSON from scratch, losing versions not found on disk. Solution: Read existing dashboard → merge history → write atomically. Changes: - Load existing dashboard before writing (history preservation) - Merge new version into history (update if exists, append if new) - Atomic writes with validation (temp file + rename) - Added DashboardJSON type with validation Implementation: - internal/eval_analysis/types.go: DashboardJSON + HistoryEntry types - internal/eval_analysis/export_docusaurus.go: History preservation logic - cmd/ailang/eval_tools.go: Pass output path to ExportBenchmarkJSON - Tests: 7 new tests (100% coverage for new code) Impact: ✅ Running eval-report twice preserves history (5 versions → 5 versions) ✅ Rerunning same version updates entry (no duplicates) ✅ Validation prevents corrupted JSON ✅ Atomic writes prevent partial writes Milestones completed: - ✅ M1: History preservation (4h) - ✅ M2: Validation + atomic writes (2h) Remaining: - ⏳ M3: Baseline metadata fixes (tools/eval_baseline.sh) See: design_docs/20251016/M-DASH.md 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Added detailed documentation for the new history-preserving dashboard update workflow (M-DASH). Emphasizes that dashboard updates now happen automatically via ailang eval-report --format=json without manual steps. Key additions: - ✅ Dashboard preserves history automatically - ✅ Validation + atomic writes built-in - ❌ Don't redirect stdout (bypasses preservation logic) - ❌ Don't manually edit latest.json This reinforces the eval-orchestrator agent's workflow and prevents users/AI from trying to reinvent dashboard update scripts. Related: M-DASH (design_docs/20251016/M-DASH.md) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Problem: baseline.json had wrong version (git describe) and cached wrong success_count (20 vs actual 74 in v0.3.9). Solution: Require explicit VERSION, separate git_describe, remove cached success_count (calculate dynamically from result files). Changes: 1. tools/eval_baseline.sh: - Require explicit VERSION env var (no git describe default) - Add git_describe as separate field - Remove success_count from baseline.json (calculated dynamically) - Improved error messages 2. Makefile: - eval-baseline now requires EVAL_VERSION parameter - Clear error message if missing 3. internal/eval_analysis/loader.go: - Already calculates success_count dynamically (lines 123-134) - No changes needed (already correct!) 4. CLAUDE.md: - Updated eval-baseline examples to show required parameter Impact: ✅ VERSION must be explicit (e.g., EVAL_VERSION=v0.3.10) ✅ baseline.json has separate version and git_describe fields ✅ success_count always accurate (calculated from result files) ❌ No more "v0.3.7-46-g2cfa80a" version confusion ❌ No more cached wrong stats (20 vs 74) Usage: make eval-baseline EVAL_VERSION=v0.3.10 VERSION=v0.3.10 ./tools/eval_baseline.sh Completes M-DASH milestone (all 3 parts done). Related: design_docs/20251016/M-DASH.md 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Regenerated dashboard using new history-preserving workflow. Changes: - Now includes api_call_json and json_encode benchmarks - Updated stats: 126 runs, 58.7% success rate - History preserved (5 versions intact) - All 22 benchmarks now in dashboard data Generated with: ailang eval-report eval_results/baselines/v0.3.9 v0.3.9 --format=json 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
sunholo-voight-kampff
added a commit
that referenced
this pull request
Mar 13, 2026
Root cause: lookupPrefix() iterated Go map nondeterministically when duplicate namespace prefixes mapped to same URI (common in EPUB/OOXML). Fix: check default namespace first before map iteration. Performance: String() methods on ListValue, ArrayValue, TupleValue, RecordValue, TaggedValue used += concatenation (O(n²)). Switched to strings.Builder. Pre-allocated slices in evalCoreList/Array/Tuple and XML attribute parsing. Zero-allocation whitespace check for CharData. Result: Moby Dick EPUB parse 62s → 11.5s (5.4x speedup). Process: added determinism verification as sprint-executor principle #9 and builtin-developer validation rule #7. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sunholo-voight-kampff
added a commit
that referenced
this pull request
Mar 24, 2026
SKILL.md: - Rule #7: publish auto-rewrites path deps (new) - Publishing checklist section (new) - Registry validator section (new) - Updated error table with duplicate key error error_solutions.md: - TOML duplicate key error (from ailang install bug) - Publishing with path deps guidance - Dependency order example Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sunholo-voight-kampff
added a commit
that referenced
this pull request
May 8, 2026
… 10 integration gaps Today's live smoke testing of v0.18.0's M-MOTOKO-EXECUTOR-ADAPTER surfaced 10 interconnected gaps that prevent trustworthy benchmark numbers. Three got partial fixes during the day (HealthCheck no-spawn, MOTOKO_REPO fallback, MOTOKO_HEADLESS, run_summary-before-done reorder) but root causes remain across both repos. User feedback: "we need it all I think. lets get to the bottom of the gaps - I think a design doc process will help." This sprint sequences the fixes properly: Phase 1: Investigation-first for gap #1 (run_summary not reaching disk on success path) — debug:checkpoint markers + bisect. Non-negotiable; writing a fix without the cause is gambling. Phase 2: motoko-side fixes (gap #1 root-cause fix + #6 extension visibility + #7 --headless flag + #8 --version mode + #10 TS process.exit removal so emission ordering doesn't matter) Phase 3: AILANG-side fixes (gap #2 success-criteria fallback to thinking.finish_reason + #5 MOTOKO_REPO discovery from wrapper) Phase 4: Cross-cutting (gap #4 session_id unification — adapter canonical, TS wrapper honors, AILANG runtime emits matching) Phase 5: Config layer (gap #3 + #9 cost_rates source-of-truth in models.yml.pricing → env-var override of motoko's profile config) Phase 6: End-to-end validation — TestEndToEnd_FullResultPopulation asserts every Result field; M5 paired-comparison motoko-claude-haiku-4-5 vs claude-haiku-4-5 produces real numbers. Architectural posture: eliminate fragile assumptions at every layer. Today's adapter assumes things that aren't true (wrapper preserves session_id, cost_rates configured, run_summary always reaches disk, loaded_extensions field accurate). After this hardening, none of those assumptions remain — each replaced with explicit observable contracts. Net axiom score: +13 (no hard violations). Strong A2 (replayability — captured runs are fully reproducible), A7 (machines first — Result fields mechanically reliable), A9 (cost visibility — eliminates $0 reporting gap). Estimated 3 working days, ~530 LOC including tests, across both repos. GATING for M5 of v0.18.0 (threshold-measurement) and v0.19.0 M-MOTOKO-EXT-PER-TASK (which needs accurate session_ids + extension visibility from this hardening). Cross-references: - v0.18.0 M-MOTOKO-EXECUTOR-ADAPTER Future Work updated to point at this hardening as the trustworthy-numbers prerequisite - v0.19.0 M-MOTOKO-EXT-PER-TASK Dependencies updated to mark v0.18.1 as BLOCKING (was just "after local validation") Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
sunholo-voight-kampff
added a commit
that referenced
this pull request
May 8, 2026
…design docs Phase 6 of v0.18.1 hardening sprint. Moves both design docs from design_docs/planned/v0_18_1/ to design_docs/implemented/v0_18_1/ and updates their status headers to "Implemented (2026-05-08)" with cross-repo commit references. Adds the v0.18.1 entry to changelogs/v0.10-current.md covering all five phases: - Phase 1 (gap #1): JSONL drain race in TS layer - Phase 2 (gaps #6, #7, #8): extensions visibility, --headless, --version - Phase 3 (gaps #2, #5): success fallback, MOTOKO_REPO discovery - Phase 4 (gap #4): session_id unification - Phase 5 (gaps #3, #9): cost rates env-var passthrough Acceptance gate: 5 of 7 conditions met; the remaining 2 (CostUSD>0 end-to-end + smoke success) blocked on a separate Bedrock validation issue (extension tool names with `/` fail Anthropic's ^[a-zA-Z0-9_-]{1,128}$ pattern). The pricing env-var plumbing is verified by unit tests; live smoke needs the extension fix downstream. LOC tally: ~80 AILANG-side + ~250 motoko-side + 11 new tests across both repos, in ~6 hours wall-clock vs the 3-day plan estimate. Sprint retrospective: investigation-first paid off — the 12 debug: checkpoint markers in Phase 1 directly identified the silent-exit point as the TS process.exit-on-done race, which would have been maddening to find by code-reading alone. The resulting fix was tiny (~25 LOC across 2 TS files) but unblocked everything downstream. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
sunholo-voight-kampff
added a commit
that referenced
this pull request
May 9, 2026
Arni's PR #6 review (with Opus 4.6's analysis) flagged that motoko_agent's ailang.toml/ailang.lock had absolute /Users/mark/dev/... paths baked in, making the lockfile non-portable and breaking any external clone. The actual fix shipped on motoko-bisect-gap1 / PR #7 (commit f105af2): swap path-based deps for registry versions — same packages, all already published. This commit adds two things to extension-packages.md so future readers won't fall into the same trap: 1. A note immediately after the host ailang.toml example explaining when to use registry vs path — and warning that path is a dev-loop tool, not a release-ready format. 2. A new "Path vs registry checklist" section with concrete jq/ailang commands to verify the lockfile before opening a PR. The example ailang.toml now uses fully-qualified registry refs ("sunholo/motoko_ext_abi" = "1.0.0") to match what users will actually write — the previous bare-name form ("motoko-ext-abi") didn't include the registry namespace. Refs: PR arniwesth/motoko_agent#6 (review by arniwesth + Opus 4.6 analysis) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
sunholo-voight-kampff
added a commit
that referenced
this pull request
May 9, 2026
The Typical usage block in std/ai.stepWithStream's docstring showed:
let render = \chunk ->
match chunk { ... }
which is doubly wrong:
- AILANG lambda syntax is \x. body, not \x -> body (the latter is the
type-arrow + match-arm syntax)
- match-in-lambda hits a known parser bug (see
design_docs/planned/v0_13_0/m-dx-match-in-hof-block-lambda.md)
Replaced with the top-level-func pattern that examples/runnable/
ai_streaming.ail actually uses. Tracks reality and demonstrates the
parser-bug workaround in one place.
Caught while wiring stepWithStream into motoko_agent
(arniwesth/motoko_agent PR #7) — the original syntax doesn't compile.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the eval dashboard reliability issues documented in
design_docs/planned/eval-dashboard-reliability.md.Problem: Running
ailang eval-report --format=jsonwas destroying historical dashboard data because it regenerated history from scratch by scanning disk, losing deleted baselines.Solution: Implemented read-modify-write pattern with history preservation, validation, and atomic writes.
Changes
M1: History Preservation (4h)
DashboardJSONandHistoryEntrytypes with validationloadExistingDashboard()to read current JSONmergeHistory()with duplicate detectionExportBenchmarkJSON()to preserve historyM2: Validation & Atomic Writes (2h)
DashboardJSON.Validate()methodwriteJSONAtomic()with temp file patternM3: Baseline Metadata Fixes (1h)
tools/eval_baseline.shto require explicit VERSIONversionfromgit_describein baseline.jsonsuccess_count(calculated dynamically)Documentation & Dashboard
Test Results
$ go test ./internal/eval_analysis/... ok github.com/sunholo/ailang/internal/eval_analysis 0.234s coverage: 89.7% of statementsAll 7 new tests passing:
Before & After
Before (v0.3.9):
After (v0.3.10):
Verification
Tested with existing v0.3.9 baseline:
Files Changed
internal/eval_analysis/types.go- Core data structuresinternal/eval_analysis/export_docusaurus.go- History preservation logicinternal/eval_analysis/export_docusaurus_test.go- Comprehensive testscmd/ailang/eval_tools.go- CLI integrationtools/eval_baseline.sh- Baseline creation fixesMakefile- VERSION requirement enforcementCLAUDE.md- Workflow documentationdocs/static/benchmarks/latest.json- Updated dashboard datadesign_docs/20251016/M-DASH.md- Design documentationBreaking Changes
None - backward compatible with existing baselines.
Next Steps
🤖 Generated with Claude Code
Co-Authored-By: Claude noreply@anthropic.com