Forest Friends Zine

Guidelines for consistent grading in LLM evals

Wil Chung — Tue, 10 Sep 2024 16:05:09 GMT

When starting off with eval, you might start with humans. It gives you experience with edge cases, and it hones your intuition of what good actually looks like. But once you do that, you need to scale up a little, which means writing guidelines on what good looks like.

This isn't time wasted, because the work you do writing guidelines for human evaluators will be similar, if not identical to the prompt you would need to give to LLMs-as-a-judge. So do the reps and don't skip this step.

You want to be specific and prescriptive in your guidelines. The idea is to codify your intuition of what good looks like to other people so they can replicate your process and judgment. Be specific but not overly prescriptive, and this can be a hard balance to strike.

First, explain the context of the task. What is the user trying to accomplish? What do they expect from the system? What is the context in which they're trying to accomplish this task?

Then, show the grader examples of responses and how those responses would be graded. The examples should cover both typical responses and edge cases. However, there's no need to enumerate every edge case. Just three to five examples will do. Both humans and LLMs can't hold that much in their proverbial heads, despite the lengthening context window.

There are sections of a prompt context that the LLM will forget and unable to find a piece of information. So do what you can to keep things clear, succinct, and to the point. For more, check out "needle in the haystack" experiments.

Then finally, present the grading scale and what it means. It's better to use a binary grade. Better yet, compare two samples, and let the grade choose which one is better.

You should give a checklist of things to would be things to watch out for. This will help human and LLM graders be more consistent about how they grade a particular output.

And after you've deployed to either human or LLM graders, you're not done. It's now time to iterate. Keep an eye on the output, and see what kind of bugs and complaints come in. Then keep adjusting your eval and see how the metrics you've set up in place improve.

💡

Intrigued about evals? We're publishing a zine on LLM evals. Click here to preview and pre-order.

Aligning LLMs-as-a-judge

Wil Chung — Tue, 03 Sep 2024 16:00:52 GMT

LLMs-as-a-judge sounds counterintuitive. How can LLMs judge its own work? If it could, wouldn't it just do better in the first place? Remember, LLMs are like stochastic functions out of the box. They don't have any memory outside of what you provide in its context.

And after all, a human can judge the work of another human, if that's easier for you to think about. Before we use LLMs-as-a-judge, we have to align it with human judgments of the output in this particular domain. The raw output of an LLM is a summary of the crush of humanity.

You start like you do with all Machine Learning: the unsexy collecting and cleaning of data. You want to collect a number of samples of what "good" looks like and cover a lot of the edge cases too. The good news is that you don't need to collect nearly the amount that you used to.

But you still need to do enough. This is why it pays to be systematic and organized about your manual vibes-based judgments. If you haven't started doing that, you'll need to start to do that to hand off the job to an LLM. This is your golden dataset, the source of truth.

This golden dataset should at least cover the most common cases and some edge cases that you care about. Any sort of biases to your dataset that you want to avoid should be paid attention to here at this stage.

Then, you want to run the examples through your LLM eval and compute how correlated the LLM eval labeling is to the golden dataset of human labeling.

Lastly, create a confusion matrix to see where the LLM and the human disagree. If there's complete agreement, then there'd only be numbers in the diagonal. Then you simply iterate your prompt to drive down confusion. It usually only takes a few iterations, empirically. Shoot for above 80%. The whole idea is to have an LLM that would judge the same as a human user.

Would you LLM eval an evaluator? Is it turtles all the way down? Usually no. Because the evaluation of an eval is usually a yes/no or the selection between choices, it's easier to judge, so we opt for statistical metrics and code to just the output of the evaluator!

💡

Intrigued about evals? We're publishing a zine on LLM evals. Click here to preview and pre-order.

Did we help you build system evals?

Wil Chung — Mon, 02 Sep 2024 16:05:24 GMT

Hello!

Sri and I want to thank you for purchasing the first issue of Forest Friends: LLM system evals in the wild.

In this first week since launch, we've hit our sales goal! And hopefully, you've had a chance to look through the issue.

If you've read through the issue, We'd like to get some feedback from you.

Did we help you build system evals for your LLM-driven app?
What topic would you want us to cover in issue 2?

If you have any feedback on these and more, please send it to iamwil@gmail.com.

Also, if you found the issue helpful, please tell your AI engineering friends about it! Thanks!

First Issue of Forest Friends is Out!

Wil Chung — Thu, 29 Aug 2024 14:33:39 GMT

The first issue of Forest Friends on Large Language Model System Evals in the Wild is out! Those that have pre-ordered have received their copy [^1]. Those of you who have been waiting to see if it'd come to fruition, the wait is over. It's now available for sale.

We originally targeted thirty pages but ended up expanding to sixty covering how to get started on designing system evals, from vibes, through choosing quality dimensions, aligning LLMs as judges, and finally t0 analyzing the evals scores.

Here's an engaging intro to evals by @sridatta and @iamwil. They've clearly put a lot of care and effort into it, where the content is well organized with plenty of illustrations throughout.

Across 60 pages, they explain model vs. system evals, vibe checks and property-based… pic.twitter.com/8TnTvAKc0D
— Eugene Yan (@eugeneyan) August 26, 2024

We go through some worked examples with Fox and Bear trying to run an LLM-driven cafe that creates recipes based on customer requests as we introduce the different concepts.

We also cover how to use LLMs-as-a-judge, by first using a golden dataset to align the LLM, and iterating on the prompt to drive precision and recall metrics up. And if you're new to data science, we explain precision and recall too.

So if you want to get an intro to the eval side of building LLM-driven apps, this is a good place to start. If you're still unsure, click here for the preview, and feel free to ask me any questions @iamwil on Twitter.

[^1]: if you pre-ordered and had trouble with the download, get in touch at iamwil@gmail.com

How to download Issue 1

Wil Chung — Sat, 24 Aug 2024 18:17:04 GMT

Issue 1: Large Language Model System Evals in the Wild

Wil Chung — Sat, 24 Aug 2024 16:24:00 GMT

Leveling up from vibes-based engineering

Wil Chung — Tue, 20 Aug 2024 20:02:48 GMT

You've got an LLM in prod and are relying on "vibes" to gauge output quality. Sound familiar? It's a start, but you'll need to instrument your pipeline and get systematic to drive real improvements.

Our brains are powerful pattern-matchers for what makes one response good in one context, but terrible in another. But it can be hard to articulate. Starting with vibe-based engineering is how you begin articulating quality through an iterative process.

However, we have to systematize what makes a response good or bad, so we compare the responses between runs, between users, and over time.

First, instrument your pipeline and collect the outputs at every stage of the pipeline. For each user invocation, record the query, prompt, retrieved docs, chain of thought output, and final output. You want to be able to review and audit the sequence of responses from the LLM should anything go wrong. You can dump them in a .txt or a database. Whatever's easier.

Then, curate your example dataset. Come up with examples that cover the main use cases and as many edge cases of your application. Also, pull from queries from user feedback and bug reports. Typical business metrics like session length, bounce rate, will all be helpful.

Do loss analysis. Look through the data in the traced output: Did the error happen at the end or somewhere near the beginning? Gather losses into the same category and prioritize the largest category of losses. Tackle the largest category for fast improvement.

Lastly, iterate! It's surprising how fast you can make improvements with manual reviews, especially at first. It’s only when we start tackling the long tail of the edge cases and comparison over time that we’d want to leverage a systematized way of evaluating how our LLM is doing.

💡

Intrigued about evals? We're publishing a zine on LLM evals. Click here to preview and pre-order.

'LLM Evals In The Wild' Zine - The Final Push!

Sridatta Thatipamala — Fri, 16 Aug 2024 22:42:52 GMT

Our zine layout on Canva. We're almost at 60 pages!

As a reminder, you subscribed to get updates on our zine about LLM evals.

We've had some folks eager to get their hands on their copy and are wondering where it is.

Rest assured, we're putting the final touches. As with many creative endeavors, it took longer than we expected. But you'll be getting some benefits.

The zine is now 60 pages vs. 30 (our original estimate) – We realized there is so much tacit knowledge that was previously only known to ML researchers and data scientists. So we added a bunch of material spelling out exactly what to do to get an eval up and running.
It features a new solarpunk art style – We reworked the art to be more futuristic. It's all Midjourney-generated now, which will let us work on future zine issues even quicker.

We've been calling it a zine but really it's shaping out to be a full-fledged book.

If you've already ordered, you'll be getting your copy by email as soon as next week. If you haven't ordered yet, go order it here while it's still at the pre-order price!

Doing vibes-based engineering right

Wil Chung — Tue, 13 Aug 2024 20:01:55 GMT

If you're struggling to whip your LLM output into shape, start with "vibe-based engineering", by judging outputs by look & feel. It doesn't scale, but it's the foundation for systematizing evals later.

What is vibe-based engineering? Just start by looking at the generated results. Do they look alright? It’s ok to start here and do things that don’t scale. Good can be hard to articulate, and by looking at output one by one, you can start to define what constitutes a good output.

Be willing to iterate. Sci-fi promised AI as an intellectual supreme but the reality is more a writer of bad high school essays by default. You must be willing to keep changing the prompt to see what coaxes more focused and inspired responses. It's helpful to think of LLMs as a throng of humanity, and it's up to you to steer it to the geniuses among us with your prompt.

Use an exacting communication style. LLMs are like fast-working interns who need a lot of guidance. By starting out judging each output manually, you’ll start to be able to articulate why an output is good. If you're still struggling, just find examples to show the LLM.

Build a golden dataset. Based on the manual work that you’re doing, you’re building a database of samples. Shoot for 30 to 50 as samples to help steer future automated eval. Examples should cover the span of outputs, as well as some edge cases.

Lastly, slowly systematize the eval process to automate. Use golden datasets, unit tests, and statistical metrics to see how well your eval process is doing. Unit tests catch easy things you don’t need an LLM for, such as output length or JSON formatting. Statistical metrics are like precision, recall, and F1 scores to triangulate the performance of the eval to be able to generate similar outputs from your golden dataset.

While vibes engineering doesn't scale, it's a good way to build intuition for your domain and the types of output that you're getting from which kinds of prompts. Before you can teach quality to another human or to an LLM, you have to know what good looks like yourself first. Vibes is definitely a way to do that, and making sure you're systematic and organized about it is a good way to slowly automate it.

💡

Intrigued about evals? We're publishing a zine on LLM evals. Click here to preview and pre-order.

Iterate, iterate, iterate your prompts

Wil Chung — Tue, 06 Aug 2024 19:59:24 GMT

Sci-fi has set our expectations for LLMs incorrectly. People write a one or two-line prompt and give up when they don’t get the expected output. To effectively use LLMs today, you need to adopt a particular habit of mind. Just like writing in school, the only way you get good writing is to:

Iterate, iterate, iterate.

Prompting is much easier for people who understand that communication is hard, and the only way to arrive at a mutual understanding is in incremental steps. Start by spelling out the task you’d like the LLM to perform. Then depending on the output, start layering on different conditions, such as the role it's supposed to take. An insightful design architect. The most moving copywriter. Or the most empathetic customer support. You can add examples of what you’re looking for based on the output. Is the output too sale-sy? Tell the LLM to tone it down. Not creative enough? Tell it to go wild on creativity.

You can dictate style and what good looks like. If you can’t describe it, find examples. Look at the output again. Not quite right? Ask it to follow specific principles you know that make good output if you were to describe it to a junior colleague.

Lay out guiding principles. Ask it to reason out the task step by step (chain of thought) before tackling the task. Don’t be overly specific. It’s easy to get lackluster output when you over-specify what you want. Finding the right balance also requires experimentation and iteration over time.

There are many other ways to improve the output. The key is to iterate, iterate, iterate.

🍲

Like what you read? We're serving it hot with a digital zine on LLM evals. Visit Forest Friends to see all the details.

Grading scales matter for eval consistency

Wil Chung — Tue, 30 Jul 2024 19:57:40 GMT

Some thought should be paid to the grading scale when systematizing your LLM evals, regardless of whether they're for humans for LLM-as-a-judge. A bad grading scale can skew your ability to assess your output accurately.

Grading scales are the ratings by which humans or LLMs judge the output of another LLM doing a task. You can ask human/LLM judges to rate the output on a numeric scale like Yelp reviews.

Typical scales are 1 to 5 or 1 to 10. The trouble is, they're inconsistent.What exactly constitutes a 5 on a 10pt scale? and exactly how is it different than a 6? Even if you could spell it out, a typical human won't be able to keep it all in their head.In addition, on a 10pt scale, people won't use the full scale evenly. Most humans will pick 7 to mean "meh." 1-6 is considered "failing", and if something is bad, they'll pounce on 1 as emotional emphasis, thus neglecting 2-6.On the other end, 8-10 are considered "safely optimistic", so out of a 10 point, you'd get rating compression to just 3 positions on a 10pt scale.

As it turns out, LLMs are also bad at this. As a result, your evals will be inconsistent with not much discrimination power.Both humans and LLMs do best with a binary judgment (yes/no, true/false, relevant/not), or a side-to-side comparison between two competing outputs.

Side-by-side comparisons can generate ELO scores over time for a ranking of outputs.The more nuanced the grading scale, the more likely two graders (or the same grader on different days!) would give different answers.

Counter-intuitive, but you get the nuance from sampling across many independent judgments, rather than from resolution in grading scale!

🍲

Like what you read? We're serving it hot with a digital zine on LLM evals. Visit Forest Friends to see all the details.

Forest Friends Eval Zine: Weekly Update

Sridatta Thatipamala — Wed, 24 Jul 2024 21:31:33 GMT

Hey Forest Friend!

Wil and I are hard at working on doing the final pass of the zine. Right now we’re focused on ensuring that it is actionable; that every page gets you one step closer to trusting in your LLM system.

Last month, I went to the AI Engineer’s World Fair in San Francisco and talked to 100+ people about their eval strategy. One of the fears that I heard was: “evals are a big company thing and not for me.”

I can see why people would think that. There’s no shortage of content describing why you should have evals. There’s very little content telling you exactly what you should build and how.

Our goal with this zine is really to help you go from “zero to eval,” and show value to your team.

If you’ve already pre-ordered, thank you for your support! And if you haven’t, go ahead and do it now!

Leverage code for system evals

Wil Chung — Tue, 23 Jul 2024 19:55:44 GMT

There's no one tool that will nail down a notion of "good" for you to express in a LLM eval. Instead, you need to use multiple tools to triangulate this notion of "good" to evaluate an output.

A simple, yet overlooked eval is code. No need to throw away tools we already know.

This goes by different names: property-based testing or unit tests. They're useful when what you want to check are very precisely defined and easy to express in code. No need to leverage humans or LLMs to do this kind of checking, as they can't do better than code.For example, simply checking JSON output formatting or conformity to an API specification can be done in code.

Certain size restrictions, such as being less than 420 chars, or that it has certain fixed templates or elements. You can do all that with regexes and counts.Don't go too far with regexes. Think about where the 80% line is. For something like detecting whether there's a full name or just a last name, lean towards a LLM. There's far too many variations in the wild for a regex.

Did you account for deBussey? van Buren? Stenson-clifford?For something like "is it an email?", you get away with a regex on "@" most of the time. While the full email spec results in a ridiculous regex to catch all edge cases, they occur far less than odd last names.

But you have to judge, for your own case, and acceptable false neg.Either way, leverage code for what it's good at: quick checks for properties of the output that you want high precision, and can be easily expressed in code.

🍲

Like what you read? We're serving it hot with a digital zine on LLM evals. Click here to see all the details.

Defining good metrics for evaluation

Wil Chung — Tue, 16 Jul 2024 16:57:15 GMT

What constitutes good? LLMs aren’t magic. If you can’t articulate what good looks like, the LLM won’t know either. Part of that is learning to pick the metrics to bracket in that elusive definition of good. All that you learned in ML engineering doesn’t go away with Prompt engineering.

When picking metrics for your eval, keep in mind there is no one good measure. You’ll need to use multiple. There are two types of metrics: optimizing metrics and satisficing metrics.

Optimizing metrics are the numbers you want to keep improving as a metric of how well you’re hitting your goal. This can depend on the domain, but keep the metric simple. It won’t capture every aspect of good, but all metrics are a proxy for good.

Satisficing metrics are guardrails to ensure your product is avoiding risk to your users or your business. Typically there are responses considered universally inappropriate in your domain, such as medical advice or life advice.

These two numbers are in tension, often playing tug-of-war at the intersection of helpfulness. For example, to improve a customer support AI, it may be more helpful to be more empathetic, but it may also decrease safety by dispensing medical advice.

Then for each of the two types of metrics, you can pay attention to two attributes for the anatomy of that metric: Quality and grading scale.

What is quality? It’s not enough to say you’ll know it when you see it. It’s also not enough to give general advice about it for every domain. For example, in a Q&A app, quality might be measured by classic information retrieval metrics like relevance, precision, and recall.

Whereas with conversational assistant apps, quality might be measured by the tool selection accuracy and the end-to-end success rate. It all depends on the application. Again, remember you’ll need multiple metrics that are simple to calculate to box in a definition of good

For the grading scale of a metric, you want to ensure you get consistent judgments from the same output. Both people and LLMs are much more consistent with binary choices and choosing the better of two versions.

Remember that metrics are proxies for quality. You'll always need to use multiple metrics to triangulate the qualities that you want in an output. And finally, you'll need to keep an eye on the actual output over time to ensure that the metrics aren't being gamed.

Each time you're doing a vibe-based eval, keep building an intuition of what good looks and write down that intuition. That'll make it easier to find metrics that will align with the intuition later down the line!

🍲

Like what you read? We're serving it hot with a digital zine on LLM evals. Click here to see all the details.

Pre-order available for LLM eval Issue 1

Wil Chung — Tue, 09 Jul 2024 20:55:09 GMT

Hi all, thanks for your interest in the LLM eval digital zine either through Twitter or meeting Sri, the guy with "Ask me about Evals tshirt" at the AI conference.

Today, you can pre-order the first issue of detailing LLM evals at forestfriends.tech. For a limited time, pre-orders are $20, down from the after-release price of $24. You'll learn about how to build evals, as well as (but not limited to):

How to do vibes-based engineering right.
Learn how different grading scales affect eval.
Witness cosmic horrors explain statistical metrics.
Bright hand-drawn illustrations eschewing AI-generated images. (or is it?)

We're still hard at work producing the assets and organizing the flow of the content. We should be ready to deliver at the beginning of August and will keep you all posted.