Introduction: The Problem
In a recent engagement, our team faced a complex challenge. Setting up a machine learning experiment required a long sequence of manual steps. None were individually hard, but together they created a fragile, error‑prone workflow. For example: – Users needed to provide key metadata (experiment name, hypothesis, ground-truth dataset), but the format wasn’t standardized, so people often filled it inconsistently. – They had to decide whether to reuse an existing baseline or create a new one — a choice that wasn’t always obvious to newcomers. – Multiple configuration files had to be manually updated for inference, evaluation, and the experiment runner, each with slightly different formats. – Retrieval parameters were configured through custom retrieval codes, which required knowing the right flags. – Prompts had to be written manually, without examples or validation. – And finally, multiple scripts had to be executed in the right sequence to provision agents and launch the experiment.
Beyond those normal settings, there are often special cases that come up for the specific experiment – for instance, maybe you need to vary the inference model for each permutation you are intending to run.
We wanted to make it quick, easy, and reliable to setup an experiment. As you can see from the list above, this setup was complex enough that users were often doing it incorrectly.
The Journey: Our Approach and Solution
We created a specialized agent (a feature in GitHub Copilot previously called “chat mode”) to help users setup experiments. The agent would ask the user questions to gather the necessary information, then generate the configuration files and scripts needed to run the experiment. The user still needed to review the generated files.
Through iterative development and testing, we discovered several best practices that significantly improved the agent’s effectiveness:
-
The first section of the prompt was a list of questions to ask the user if they did not already provide the information. The agent was instructed to ask all questions at once before proceeding.
-
The second section of the prompt was a list of instructions for copying and generating the files.
-
The third section was a checklist that the agent would create in a tracking file, validate each setup step, and then mark it complete. This 2nd stage validation improved the task adherence significantly.
-
The fourth section was an FAQ section so that if the user asked questions about the process, the agent could answer them. For instance, if the user was not sure if they needed to run a new baseline, they could ask the agent and it would provide guidance.
-
We tested all the major models and found that “Claude Sonnet 4” (a model from Anthropic) worked best for this task. Later versions of Claude also worked, but not quite as well. Models from OpenAI and others rarely worked. We pinned the model in the agent to ensure consistency.
-
We developed command line tools that the agent could use to provision the agents, retrieval codes, etc. These tools reduced the amount of code the agent needed to write and improved reliability.
Section 1: Questions to Ask the User (EXAMPLE)
Collect the following information from the user if they haven’t provided it already:
a. EXPERIMENT_NAME: A unique identifier (e.g., "retrieval_threshold_test") so generated files, metrics, and logs stay organized.
b. HYPOTHESIS: A short statement about what the user expects to learn from the experiment.
c. SPRINT: The current sprint folder for their ground truth (example, "sprint01", "sprint02", etc.).
d. GENERATION_METRICS: Whether or not the user wants to collect generation metrics during the experiment. Early sprints and experiments that are focused on retrieval quality may not need generation metrics collected.
e. PERMUTATIONS: Each permutation will ultimately be a prompt file. If the experiment is varying a retrieval parameter then that determines the number of permutations (eg. vary threshold from 1 to 4 in increments of 0.5 means 7 permutations). If the experiment is varying prompts themselves, then ask the user the number of permutations they want to start with.
f. BASELINE: If the user says this is a baseline experiment, then consider it a baseline, otherwise, don't ask and just assume it is a standard experiment.
Section 2: Instructions for Generating Files (EXAMPLE)
-
Create a new experiment directory within the project structure using the EXPERIMENT_NAME they provided. Make it a subdirectory using underscore between the words in the name (example, “my_experiment”).
-
Verify that you have created the experiment directory and alway use that directory for all subsequent files and scripts related to the experiment.
-
Copy the
template.mdfile to the new experiment directory asREADME.mdand fill out as much as you can based on the information the user has already provided. -
Write the hypothesis in the
README.mdfile in a section titled “Hypothesis”.
…and so on…
Section 3: Checklist for Validation (EXAMPLE)
-
[ ] EXPERIMENT_NAME collected as a descriptive name
-
[ ] HYPOTHESIS collected as a short statement
-
[ ] SPRINT collected (example, “sprint01”, “sprint02”, etc.)
-
[ ] GENERATION_METRICS decision collected (true/false)
-
[ ] PERMUTATIONS collected and documented
-
[ ] Experiment directory created with name using underscores (example, “my_experiment”)
-
[ ] Directory path is correct:
/experiments/<experiment_folder>/ -
[ ] Verified that experiment directory exists and is being used for all subsequent files
Section 4: FAQ (EXAMPLE)
The user may ask follow-up questions after you have created the experiment files. Answer them as best as you can using this FAQ:
-
Q: How do I get started?
A: Just describe the experiment you want to create. Here’s an example: “create me an experiment that varies threshold between 1 and 4 by increments of 0.5”.
-
Q: What do I do now?
A: You run the
add_codes.pyto add any codes to the retrieval service. You run theprovision.pyscript to provision the agents for the experiment. Finally, you run theevaluate.pyscript to run the experiment and collect results.
…and so on…
Opportunities for Improvement
This approach worked well enough, but it wasn’t perfect. Here are some of the issues we encountered:
-
We needed to simplify the experiment process. There were so many options and configurations that it was difficult for anyone, even an agent, to keep track of them all. One option might be to have all configurations in Azure App Configuration (a Microsoft product) and then have the agent only use label/tags to say which configuration is relevant.
-
Even though we pinned “Claude Sonnet 4” as the model for the agent, users often chose to change it and that would almost always cause issues. Perhaps there is a way to ensure the solution can only run with a specific model.
-
Users did not always describe what they were trying to do effectively. The agent would write the configuration files and scripts based on its understanding, which was not always correct. Some of this inaccuracy was just the user lacking specificity, but in some cases, the user didn’t really understand what they were trying to do. One way to address this inaccuracy is for the agent to summarize its understanding and ask follow‑up questions before generating files.
-
It wasn’t always clear to the user what was produced once there were dozens of files. A summary of what was generated and the options selected would be helpful. For example, without looking at code, it would be nice to know which model was used for inference and evaluation, which retrieval codes were used, etc. Perhaps a file per permutation could be generated containing all the relevant information for that permutation.
Conclusion
Using an agent to automate experiment setup transformed our workflow. The speed and reliability was improved dramatically. While challenges remain—particularly around model selection and user communication — this approach demonstrates how agents can handle complex, multi-step configuration tasks effectively.
Attribution
The image used in this post was created using ChatGPT 5.2 (an OpenAI model and Microsoft partner).