You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tracking planned work for ArkSim. Items are roughly in priority order.
Next
CI/CD support
Per-metric score thresholds - Configure pass/fail thresholds for individual metrics (e.g. faithfulness >= 0.8, goal_completion >= 0.9) so pipelines can gate on specific quality criteria.
GitHub Actions - Reusable action to run arksim simulate-evaluate as a pipeline step.
Config validation - Catch invalid configs and missing environment variables early with clear, actionable error messages instead of cryptic failures mid-run.
Custom metrics improvements - Better error reporting for custom metric failures and improved visualization for qualitative metrics.
UI improvements - Persist scenarios, simulations, and evaluation results across sessions. Better scenario management and result browsing.
Later
Tool call evaluation - Evaluate tool call accuracy by reading tool call responses and validating against expected behavior.
Agentic simulation engine - Multi-knowledge simulation for richer, more realistic scenarios. Agents that pull from multiple knowledge sources across conversation turns.
Streaming support - Handle agents that stream responses (SSE / chunked transfer), so long-running agents don't time out during simulation.
Have a feature request? Comment below or open an issue.
Tracking planned work for ArkSim. Items are roughly in priority order.
Next
arksim simulate-evaluateas a pipeline step.Later
Have a feature request? Comment below or open an issue.