DEV Community: Jamesb

Unlocking Advanced RAG: Citations and Attributions

Jamesb — Mon, 29 Jan 2024 17:22:41 +0000

Often we want LLMs to cite exact quotes from source material we provide in our prompts. This is useful in academic contexts to cite snippets from papers, for law firms who need to cite sections of the legal code for a case and in business applications where knowing the exact source of a quote can save hours of scrolling through financial documents. But it's not as simple as asking the LLM to return exact quotes in the prompt. We can't trust that the LLM won't just hallucinate a quote or citation that doesn't really exist. So how can we get LLMs to provide exact quotes or citations and verify that they are correct?

I ran into this problem while working on the RemNote Flashcard Copilot. My goal was to allow new users to generate flashcards by highlighting a paragraph from their notes. I wanted to add citation pins linking the AI generated flashcards to each sentence in their notes so that new users could understand how the AI had used their notes to generate the flashcards.

Naive substring checks don't work because LLMs will make small changes in punctuation and wording, correct spelling and grammar mistakes.

Initially I attempted to split by sentence and generate flashcards per sentence. But this is not multilingual - different languages use different sentence delimiters. End of sentence delimiters can have different meanings in different contexts. Of course I could ask an LLM to chunk a paragraph into sentences, but this would add a bunch of additional waiting time for the user.

Then I tried using using an LLM to verify that citations are correct. The issue with this is that this is still error-prone - you still run into situations where the checker LLM says that the cited sentence is correct and present but when you go to search for it, you can't find it.

Solution

What we need is a way to verify with a high probability that the sentence or paragraph cited by the LLM is a genuine citation. And beyond that, it would be ideal if we could find the original citation within the source text ourselves rather than trusting the LLM citation. This is useful when we want to add UI elements to text we are rendering in our application.

The solution I came up with can be summarised as follows:

LLM cites a sentence
I fuzzy search to find the best match to that sentence in the original text
If the match ratio is close (>90%) it's considered valid

Fuzz Partial Ratio is useful when you want to find the similarity between two strings, focusing only on the best matching substring. So searching for 'pie' inside 'apple pie' yields a score of 100 because the shorter string 'pie' is found within the longer string 'apple pie'.

I modified the original partial_ratio function to return the best scoring match and its start index, because this weirdly wasn't included in the partial_ratio function from fuzzball.



import { ratio, } from 'fuzzball';
import { SequenceMatcher } from 'difflib';

// modified from: https://github.com/nol13/fuzzball.js/blob/773b82991f2bcacc950b413615802aa953193423/fuzzball.js#L942
function partial_ratio(str1: string, str2: string) {
  if (str1.length <= str2.length) {
      var shorter = str1
      var longer = str2
  }
  else {
      var shorter = str2
      var longer = str1
  }
  var m = new SequenceMatcher(null, shorter, longer);
  var blocks = m.getMatchingBlocks();
  let bestScore: number = 0;
  let bestMatch: string | null = null
  let bestStartIdx: number = -1
  for (var b = 0; b < blocks.length; b++) {
      var long_start = (blocks[b][1] - blocks[b][0]) > 0 ? (blocks[b][1] - blocks[b][0]) : 0;
      var long_end = long_start + shorter.length;
      var long_substr = longer.substring(long_start,long_end);
      var r = ratio(shorter,long_substr);
      if (r > bestScore) {
          bestScore = r;
          bestMatch = long_substr;
          bestStartIdx = long_start;
      }
      if (r > 99.5) {
        break;
      }
  }
  return {
    bestMatch,
    bestScore,
    bestStartIdx,
  }
}

I hope this can be useful to someone!

Fine Tuning LLMs to Process Massive Amounts of Data 50x Cheaper than GPT-4

Jamesb — Mon, 08 Jan 2024 18:40:14 +0000

In this article I'll share how I used OpenPipe to effortlessly fine tune Mistral 7B, reducing the cost of one of my prompts by 50x. I included tips and recommendations if you are doing this for the first time, because I definitely left some performance increases on the table. Skip to Fine Tuning Open Recommender if you are specifically interested in what the fine tuning process looks like. You can always DM me on Twitter (@experilearning) or leave a comment if you have questions!

Background

Over the past month I have been working on Open Recommender, an open source YouTube video recommendation system which takes your Twitter feed as input and recommends you relevant YouTube-shorts style video clips.

I've successfully iterated on the prompts, raising the interestingness and relevancy of the clip recommendations from 50% to over 80%. See this article where I share the prompt engineering techniques I used. I also implemented a basic web UI you can try out. Open Recommender even has 3 paying users (not my Mum, Dad and sister lol 😃).

So we are one step closer to scaling to millions of users.

But lets not get ahead of ourselves. There are a couple of problems I need to solve before I can even consider scaling things up. Each run is currently very expensive, costing at least $10-15 per run 😲, with the bulk of the cost coming from pouring vast numbers of tokens from video transcripts into GPT-4. I expect the cost will get even worse when we add more input sources like blogs and articles!

It also takes 5-10 minutes to run the entire pipeline for a single user 🫣. There are still some things unrelated to model inference that can be improved here, so I'm not massively concerned, but it would be great if we could speed things up.

Fine Tuning

Fine tuning provides a solution to both the cost and speed issues I'm experiencing. The most powerful LLMs like GPT-4 are trained on a vast array of internet text to understand language generally and perform a wide array of tasks at a high level. But most applications outside of highly general AI agents only require the LLM to be good at a specific task, like summarising documents, search result filtering or recommendation reranking. Instead of a generalist Swiss Army knife LLM, we'd rather have a highly specialised natural language processing function.

Fine-tuning involves taking an LLM with some general understanding of language and then training it further on a specific type of task. It turns out that if we only require proficiency at one specific language-based task, then we can get away with using a much smaller model. For instance, we could fine tune a 7 billion parameter model like Llama or Mistral and achieve performance that is just as good or better than GPT-4 on our specific task, but 10x cheaper and faster.

Now traditional fine tuning is a massive pain, because it requires a bunch of setup, manual dataset curation, buying or renting GPUs, and potential frustration if you set something up incorrectly and your fine tuning run blows up in the middle. Then once you have your fine tuned model, you have to figure out how to host it and manage the inference server, scaling, LLM ops etc... That's where OpenPipe comes in.

OpenPipe

OpenPipe is an open source project that specialises in helping companies incrementally replace expensive OpenAI GPT-4 prompts with faster, cheaper fine-tuned models.

They have a simple drop-in replacement for OpenAI's library which records your requests into a web interface to help you curate a dataset from your GPT-4 calls and use them to fine-tune a smaller, faster, cheaper open source model.

Effectively OpenPipe helps you use GPT-4 to "teach" a smaller model how to perform at a high level on your specific task through fine tuning.

The OpenAI Wrapper Playbook

You need to have a dataset of examples to fine tune these models first, because out of the box performance of these smaller models cannot compete with GPT-4. So we need to start out with GPT-4 to collect synthetic training data, then fine tune an open source model once we have enough examples.

So at a high level the playbook for LLM startups looks like this:

Use the most powerful model available and iterate on your prompts until you get something working reasonably well.
Record each request, building up a dataset of input and output pairs created with GPT-4.
Use the dataset to train a smaller, faster, cheaper open source model.

✨ When exactly you decide to fine tune will vary depending on your use case. If you use LLMs for heavy amounts of text processing, you might need to do it before you scale, to avoid huge OpenAI bills. But if you only make a moderate number of calls, you can focus more on prompt iteration and finding product market fit before fine tuning. It all depends on your revenue, model usage and costs.

Fine Tuning Open Recommender

Here's what the process looked like in practice for Open Recommender.

Setup OpenPipe

OPENPIPE_API_KEY="<your key>"

Then replace the OpenAI library with OpenPipe's wrapper.

Replace this line

from openai import OpenAI

with this one

from openpipe import OpenAI

✨ Turn on prompt logging. It's enabled by default, but I disabled it during the prompt iteration phase thinking that I'd bloat the request logs with a bunch of data that I'd need to filter out later. With hindsight, it's better to just keep it turned on from the start and use OpenPipe's request tagging feature to save the prompt_id on each request log. Later when you want to create a dataset for fine tuning, you can easily filter your dataset down to a particular version of your prompt.

Creating a Dataset

The next step is to run your prompts a bunch to create some data!

After running my prompts a bunch on my initial set of users, I had a moderate sized dataset. I decided to do a quick and dirty fine tune to see what kind of performance I could get out of Mistral 7B without any dataset filtering or augmentation.

✨ I exported all of my request logs and did some quick analysis with a script to figure out which prompts my pipeline account for the bulk of the latency and cost. I figured it makes sense to prioritise fine tuning for the costliest and slowest prompts. For Open Recommender, the "Recommend Clips" prompt which is responsible for cutting long YouTube transcripts into clips is very slow and costly, so I started there.

In the OpenPipe request log UI you can add lots of different filters to create your dataset. I just did a simple filter for all requests with a particular prompt name.

Then you can create the dataset and get to fine tuning!

Which Model?

Before you start a fine tuning job, you need to pick which model you want to fine tune. Since this is my first time doing this, I can't give very specialised advice, but from internet research larger models generally have higher capacity and can perform better on a wider range of tasks but require more computational resources and time for both fine-tuning and inference. Smaller models are more efficient but might not capture as complex features as larger ones.

I decided to go with Mistral 7B because my prompt is quite simple - it's a single task with a single function call response. Also I want to speed up the pipeline. If the outcome with a smaller model is good enough, then I can avoid needing to optimise the dataset or switching to a larger, slower the model.

A few hours later:

It's automatically deployed!

Evaluation

Clicking this link takes you to a page where you can see a comparison between GPT-4's output and the fine tuned model on the test set (OpenPipe automatically uses 20% of your dataset for testing).

From a quick scan it looks promising...

But the only way to know for sure is to play with it. I decided to run it on three test cases:

Video	Tweets	Relatedness
Lex Fridman Podcast with Jeff Bezos	Three tweets about LLM recommender systems and LLM data processing pipelines	Medium
Advanced RAG Tutorial	1 tweet about AI therapists with advanced RAG	Strong
Podcast about plants	Three tweets about LLM recommender systems and LLM data processing pipelines	Unrelated

Medium

❌ 2 clips hallucinated that the user's tweets
indicated interests in Amazon / startups.
✅ 2 clips correctly linking moderately related
clips to the user's tweets about data pipelines

Strong

✅ 1 clip extracting the most relevant clip from the transcript

Unrelated

❌ 1 clip hallucinating that user's tweets indicated interests in plants

So there were definitely some performance decreases over GPT-4. I suspect this was largely due to my dataset - I realised that my dataset contained very few cases where the transcript was unrelated to the tweets, because I was collecting data from the end of the pipeline where it's already been through a bunch of filter steps.

✨ Because these smaller models have weaker reasoning abilities than GPT-4, you need to make sure your training dataset covers all possible input cases.

The fine tuned model is absolutely still usable though given its performance on strongly related transcripts. Additionally, since my pipeline has a re-ranking step to filter and order the clips after the create clips stage, any unwanted clips should get filtered out. Couple that with the 50x price decrease over GPT-4, and it's a no brainer!

Future Improvements

We can actually make some additions to the LLM startup playbook above. After collecting our dataset from GPT-4, we can improve the dataset quality by filtering out or editing failed cases. Improving the quality of the dataset can help us get better output performance than GPT-4.

I've been thinking about ways the dataset filtering and curation workflow can be improved. OpenPipe already lets you attach data to your request logs. This is useful for tagging each request with the name of the prompt. Something that the OpenPipe team are working on is adding extra data to a request at a later time. Then you would be able to use user feedback to filter down the dataset, making dataset curation a lot less work.

For example, in Open Recommender, likes and dislikes could be used to filter the request log:

Finally, OpenPipe supports a couple of cool additional features like GPT-4-powered LLM evals and token pruning. I'll do a walkthrough of those when I do more intensive fine tunes later.

Next Steps

That's it for now! Thanks for reading. If you want to try out the Open Recommender Beta or have questions about how to use OpenPipe, please DM me on Twitter @experilearning.

From Spaced Repetition Systems to Open Recommender Systems

Jamesb — Sun, 31 Dec 2023 11:21:03 +0000

In this piece, I want to connect YouTube Shorts and TikTok to spaced repetition and incremental reading which I've worked on for the past 4 years.

If you think about it, all of these systems fall into the general bucket of "recommender systems" - systems which attempt to predict and provide content that is most relevant to the user based on their past interactions.

I made the connection while working at RemNote and thinking about ways to make the spaced repetition review experience more enjoyable - why do people kick back and relax at the end of a long day by watching four hours of YouTube shorts, but find 10-minute Anki review sessions to be a miserable chore? Is the only way to make flashcard queues more engaging to make the content more sensationalist and "trashy" similar to YouTube shorts and TikTok, or can you become addicted to educational content that helps you improve your life and make progress towards your goals?

I want to explore the potential for a system that borrows the best elements from each to create something that feels as effortless and engaging as a queue of YouTube shorts but actually helps you make progress towards meaningful goals.

A Recommendation System for Your Memory

Spaced repetition systems (SRSs) are digital flashcard managers like Anki, SuperMemo and RemNote. You can think of them as recommender systems for your memory which direct your attention towards information you are about to forget.

While they are extremely effective at combatting forgetting, many users struggle to maintain consistent review habits because the system prioritises showing you information that you are about to forget over information that you would find interesting.

How Spaced Repetition Works

Spaced repetition systems use algorithms which calculate the optimal review times and fewest repetitions required to keep flashcards in your memory. The intervals between reviews expand with each repetition allowing you to maintain your knowledge with exponentially lower effort over time.

By controlling the rate of forgetting, you eliminate the churn effect in learning. Without spaced repetition, your knowledge grows asymptotically towards a saturation level as the rate of forgetting old knowledge balances the rate of learning new knowledge.

Spaced repetition linearises the acquisition of knowledge. By calculating optimal review dates and exponentially increasing review intervals, a single flashcard may be repeated as little as 6-20 times across your entire lifetime, meaning that you may be able to remember as many as 250-300 thousand flashcards by the time you retire.

Why Hasn't Spaced Repetition Taken Over the World

Spaced repetition has proved itself to be an effective learning technique across a wide range of disciplines, from language learning to medical school, mathematics and coding.

I used Anki to learn Chinese to the point where I can comfortably read science fiction like The Three Body Problem without a dictionary. I used SuperMemo to learn coding from scratch with no technical background which then became my full-time job. And I used RemNote to learn math, later writing an intro to logic and proof course using an interactive theorem prover and RemNote as someone who hadn't studied math since I was 16.

Spaced repetition can change your life and it has the potential to change the world, but it's rare to see it used outside of cramming for exams (language learning being the major exception). Students endure the spaced repetition algorithm until they graduate and feel a great sense of relief when they can finally delete their flashcard decks for good. Burnout is also common with many users feeling crushed by their daily repetitions.

Misaligned Objective Function

The objective function of the spaced repetition algorithm is to maximise the user's memory retention with the fewest number of repetitions. I've argued before that this is often out of line with users' real-world goals and makes it hard to enjoy the review experience.

// Detect dark theme var iframe = document.getElementById('tweet-1476655088999538699-151'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1476655088999538699&theme=dark" }

Because the goal is to maximise memory retention, the bulk of your reviews will be focused on the flashcards that are most difficult for you. From the perspective of the spaced repetition algorithm, it would be considered a waste of time to review flashcards that you easily remember because if you can easily remember them, the system can confidently boost the review interval far into the future, satisfying its goal of minimising your repetition load. Instead, it's considered better to show you the cards you find most difficult more often so you can burn them into your memory.

As a result, reviewing your flashcard queue becomes a punishing experience where you are predominantly reviewing information you don't understand very well. Avoiding this requires quite a lot of attention and care, obsessing over flashcard formulation, eliminating leeches and aiming for a consistent 90% retention rate.

Here's another way to think about it: in a pure spaced repetition system flashcards outcompete other flashcards for your attention by being more difficult to remember. Wouldn't we be happier if our flashcards competed for our attention by being more interesting as opposed to more difficult? I realised that this is effectively how recommender systems like YouTube and TikTok work - they are markets of video clips competing to be as interesting and engaging as possible for users. This thought is what got me interested in studying them more closely.

Incremental Reading

Incremental reading systems like SuperMemo are a superset of spaced repetition systems. Incremental reading involves interleaving your flashcards with snippets from articles, books and videos in a big prioritised queue.

The general idea is to form a "funnel of knowledge" in which a vast worldwide web is converted into a "selection of own material", that moves to important highlights (extracts), that get converted to active knowledge (cloze deletion), which is then made stable in memory, and, in the last step, acted upon in a creative manner in problem solving. - Incremental Reading

It's interesting to analyse them through the lens of recommender systems and compare them to pure spaced repetition systems like Anki because incremental reading systems do a much better job at allowing the user to enjoy the review process.

Between 2020 and 2022 I spent a lot of time using SuperMemo, read all of Piotr Wozniak's articles, talked to hundreds of people in the SuperMemo.Wiki Discord channel and started a YouTube channel and podcast related to these ideas.

I met ex-gamers who went from spending 8 hours per day playing League of Legends to live-streaming themselves reviewing their incremental reading queue for 8 hours per day.

Many people were not in school (myself included) or had never been to school at all! We didn't have exams to study for. But instead of playing video games, we spent all day using this obscure learning software with a UI like a NASA rocket dashboard.

Instead of watching Netflix, we had month-long debates over whether flashcards ought to be formulated using analogies
and made guides about how you should prioritise and value learning material. Why?

The only way to explain it is by breaking down the concept of knowledge valuation.

A Nose for the Interesting

Humans are naturally attracted to information that is surprising, novel, consistent and coherent. Sometimes we are attracted to contradictory information - I think it depends on your personality type. But for the most part, we love information that "slots in" or that we can relate in some way to our prior knowledge and our current goals. Understanding is pleasurable. We don't like information that we can't relate in some way to our prior knowledge. When we are forced to try to understand something when we don't know the pre-requisite concepts, we get bored and frustrated and start yawning.

Semantic Distance

Semantic distance is like the "knowledge gap" between what you already know and some new information input. For example, there's a semantic distance between your current knowledge of mathematics and a new mathematical subject you haven't studied before. When the gap is too large, it makes it impossible to understand the new information because you can't relate it meaningfully to what you already know. If you take a graduate-level math lecture before having studied math 101, it will be impossible, you will display physical signs of displeasure, like fidgeting, restlessness, yawning and more - your body is telling you to go do something else!

Anyone who has used spaced repetition systems at school understands the feeling of needing to drill flashcards into your brain that just won't stick and it's a horrible feeling.

There are two main features in incremental reading systems which avoid this making them much more enjoyable to use - prioritisation and variable reward.

The Priority Queue

Incremental reading systems add prioritisation to the queue, so unlike a pure spaced repetition system, you aren't shown information based only on how likely you are to forget it. You can use the priority system to attach a subjective sense of importance or "interestingness" to each item in the queue. The priority is just a number from 0-100, what exactly it means is up to the user to determine and will likely vary over time as their values and goals evolve.

You are also encouraged to break up large pieces of content into smaller chunks and apply more granular prioritisation mechanisms. This is very similar to the way people chop up clips from long podcasts and upload the highlights to YouTube shorts or TikTok.

This is an improvement over pure spaced repetition because flashcards no longer compete for your attention based on how difficult they are, but also based on how interesting you find them. Since understanding is pleasurable, people naturally assign the highest priorities to material that is novel but comprehensible. Any material that is too complex gets sent to the back of the queue. By the time you reach it, you may have built up the pre-requisite concepts for it to become interesting to you.

This works well as long as you are diligent about updating priorities for each flashcard or article in your collection to maintain the mapping between how interesting your brain finds them and their order in the priority queue. But what inevitably happens is that you take a break from the system, your interests change and the priorities you assigned 6 months ago are now out of sync with your current interests. Users end up complaining about "stale collections" of material they used to find interesting, but which they no longer care about.

It would be better if the system could quickly and dynamically adjust what it recommends to you by inferring your interests from your current behaviour, similar to modern recommender algorithms.

Variable Reward

Just having a mixture of different information types in your queue makes it much more enjoyable. It can be draining to read or answer flashcards for hours on end, why not break it up with some more passive sources like YouTube videos?

Incremental reading systems differ from recommender systems in that you are responsible for adding all the material manually into your queue. But as the intervals between repetitions and the amount of items in your collection grows, you tend to forget what you put in your queue, so you can't predict what's coming next. All you know is that whatever comes next is something that you believed was important enough to import and assign a priority. This makes going through your queue surprising. Articles and videos interleaved together with flashcards force seemingly unrelated concepts into your attention in quick succession, often resulting in unexpected connections.

But current incremental reading systems aren't set up to give you completely unexpected content - they rely on you manually going out to search for things. Nor is there any notion of collaborative filtering - incremental reading systems are single-player games. And due to information overload and the degradation in the quality of the search results from Google search, this is becoming more and more frustrating.

YouTube Shorts

Now let's examine YouTube shorts. Why is it that people are able to kick back and relax at the end of a long day by watching four hours of YouTube shorts, but would despise spending four hours doing flashcard repetitions in Anki?

Why do people willingly forego meals and sleep to watch TikTok but only do Anki repetitions reluctantly?

For spaced repetition apps (and ed-tech apps in general) to take over the world they need to see themselves as competing with Netflix, TikTok and YouTube shorts for users' attention. When you scroll through TikTok or YouTube shorts, you never have the experience of failing to understand something. Everything is perfectly comprehensible. There is the novelty and surprisal aspect of not knowing what is coming next. YouTube shorts is a market of clips competing to entertain you the most. Clips are collaboratively filtered by the community and recommended to you based on your watch history. They have high retention because they allow you to quickly "channel zap" and find something new when you get bored.

But while the user interface and algorithms implemented into YouTube shorts and TikTok make them very engaging, the content quality is often extremely bad. It's hard to curate a feed of really high quality educational content that will advance you towards your goals. These systems are black boxes with little opportunity for customisation. They also lack the active recall aspect of spaced repetition systems which help you internalise and reflect on important concepts.

A Better Recommender System?

Is there potential for a recommender system that blends the best characteristics from spaced repetition, incremental reading and YouTube shorts into one? People will argue that without sensationalist, clickbait content, YouTube shorts and TikTok would lose their addictive power, but my experience in the SuperMemo.Wiki Discord has shown me that people can become hooked on content that improves their life under the right circumstances.

What I dreamt of back in university was a system that could infer my interests from my daily activities, like flashcard reviews, reading my behaviour and browsing habits, and use that information to recommend me videos, articles and podcasts from YouTube that I could watch in the evening after school. I never got anything off the ground until a month ago when I revisited Erik Bjäreholt's blog post on Open Recommender Systems and realised that LLMs have made sophisticated, customisable and explainable recommendation systems easier than ever to build!

I think huge advances have suddenly made possible and we need to start exploring what is possible now we can use LLMs as agents, constantly scanning the web on our behalf, searching for golden nuggets that we'll find interesting.

Open Recommender

OpenPipe / open-recommender

Using LLMs to implement an open source YouTube video recommendation system.

Open Recommender - An open source, LLM-powered YouTube video recommendation system

⚠️ Work in Progress... ⚠️

Build an MVP of the data pipeline
Iterate on the prompts until 8/10 recommendations are good
Curate fine tune dataset
Create a website so people can sign up
Collect more fine tune data
Fine tune using OpenPipe
Scale to more users
Add more recommendation sources (e.g. articles, twitters, books, etc.)
Scale to millions of users

🚀 Overview

Welcome to Open Recommender, an open source recommendation system for YouTube. See the video intro below

🏆 Goals

Understand your current interests by analyzing your Twitter engagement (likes, retweets, etc.)
Create a large database of potentially interesting videos
Recommend interesting sections from videos
Recommend "timeless" content rather than "trending" content
Surface "niche bangers" - difficult to find but really high quality content
Biased towards learning as opposed to entertainment
Read my blog post for more: Building an…

View on GitHub

For the past month I've been working on a proof of concept for an open source recommender system called Open Recommender. Currently it works by taking your public Twitter data as input, uses LLMs to process it and infer your interests and searches YouTube to curate lists of short 30-60 second clips tailored to your current interests. I wrote an article about it here and made a video showing an early prototype.

I already have plans to make it even more targeted at an individual's unique interests and prior knowledge - for example it could take arbitrary information sources as input, like the history and probability of recall for all of the information you had added into your spaced repetition system.

I'll also be adding ways to tune the behaviour of the recommender to your personal tastes - the recommendations could exist along a spectrum from appealing to the interests the system knows the learner already has to giving recommendations from new unexplored territory. And because it's implemented using LLMs, you can provide custom instructions using natural language and the LLM can provide understandable explanations for its recommendations.

And I'm also working on an early version of the UI:

If this sounds exciting to you and you are interested in supporting the project, you can subscribe here and I'll include you in the test of the UI before it becomes available to anyone else, as well as implement your feedback ASAP. If you have any good ideas, please get in touch with me on Twitter @experilearning. Thanks for reading!!

Avoiding Cascading Failure in LLM Prompt Chains

Jamesb — Fri, 29 Dec 2023 14:06:28 +0000

A common problem faced when building LLM applications composed of chains of prompts is that failures and inaccuracies early on in the chain get compounded into system-wide failures. It's like the cascading failure problem where a failure in a small subset of nodes propagates outwards bringing down the entire network.

I noticed this a lot while working on Open Recommender, an open source YouTube video recommender system which takes users' Twitter feeds as input, infers the kind of topics they are interested in and searches YouTube to find relevant YouTube videos and clips to recommend them.

The Pipeline

The beginning of the data processing pipeline looks like this. The main part I'll discuss in this article is the createQueries step.

const pipeline = new Pipeline(...)
  .addStage(validateArgs)
  .addStage(getTweets)
  // generates YouTube queries based on tweets
  .addStage(createQueries)
  .addStage(searchForVideos)
  // ... more stages
const results = await pipeline.execute();

When I run pipeline.execute, each stage gets executed sequentially and the output of the previous stage is passed as input to the next. The getTweets stage outputs a list of a user's last 30 tweets. These get passed to the createQueries stage which constructs a list of YouTube search queries. Those search queries are then passed to the searchForVideos stage which searches YouTube and returns a list of search results for each query.

Chinese Whispers

The problem is that LLM prompt chains are like a game of Chinese whispers - without building error recovery mechanisms into your program errors compound into stranger and stranger outputs.

I kept running into issues where the createQueries function was simultaneously a strong determinant of the quality of the final recommendations as well as very difficult to get working reliably.

Constructing really effective search queries is inherently a difficult problem because it requires a great deal of knowledge about the user - it's not enough to know that the user has tweeted about a particular topic, you also need to infer the user's expertise level and whether it's a passing interest or something they really care about.

Initially my approach in the createQueries stage was to run the user's tweets through a prompt called inferInterests. The idea was to extract an array of topics (concepts, people, events and problems) the user was interested in and use those to construct search queries. But this felt like quite a one dimensional compression of the users interests and erased a lot of nuance in terms of what the user was expressing about the topic.

This meant that the quality of the createQueries output could range between great and very poor and as many as half of the recommended videos presented to the user at the end of the pipeline felt irrelevant.

It was difficult to build in error recovery mechanisms too, because if I added a step to compare the video search results against the queries, they would look reasonable, but comparing them against the tweets made it clear that a lot of results were missing the mark in terms of relevancy.

The Solution

My first realisation was that compressing a user's tweets into a list of topics, people, events and problems was an extremely lossy compression of a user's interests. And strong lossy compression does not allow stages later in the pipeline to effectively recover from errors.

For that reason I removed the intermediate inferInterests step and instead generate queries directly from the user's tweets:

CreateYouTubeSearchQueries.execute(args: {
    user: string;
    tweets: Tweet[];
}): Promise<{
    queries: {
        query: string;
        tweetIDs: number[];
    }[];
}>

Note that in the return type I ask GPT to include the IDs of the tweets that it used to generate each search query. Later in the pipeline I use these tweets to double check that the outputs of subsequent stage are still relevant. So for example, in the signature of the filterSearchResults prompt, you can see that it takes arrays tweets and search results as input and returns an array of search results with relevancy scores:

FilterSearchResults.execute(args: {
    user: string;
    tweets: Tweet[];
    results: SearchResult[];
}): Promise<{
    result: SearchResult;
    relevance: number;
}[]>

In this was I'm controlling the error compounding by comparing against the "ground truth" of users' tweets.

Additionally by adding a simple relevancy score in the prompt output schema, I can filter out bad query recommendations by setting a search result relevancy cutoff value.

Finally at the end of the pipeline, I added a more expensive filtering and ranking step inspired by RankGPT to do a final ordering over the remaining video clips, picking only the top 10-15 to recommend to the user.

Core Takeaways

It's best to carry the "ground truth" data for your prompt through the pipeline rather than relying on a lossy compressed summary of it.
Employ other mechanisms like filtering and re-ranking to minimise the effect of errors built up earlier in the pipeline.

Next Steps

DM me on Twitter (@experilearning) if you want to try the current version of Open Recommender! I'll run it on your Twitter data and send you the results. Check out the beta roadmap to see what will be available over the next month or so.

Open Recommender Beta Roadmap

Jamesb — Fri, 29 Dec 2023 07:31:58 +0000

Over the past month I have been working on Open Recommender, an open source YouTube video recommendation system which takes your Twitter feed as input and recommends YouTube-shorts style clips tailored to your interests. I made a video about it if you want a more in-depth introduction.

In this article I want to quickly share my plan for the next month to polish Open Recommender to a beta state. If you prefer, there is a 3 minute breakdown of the roadmap, otherwise feel free to skim the rest of the article.

Tasks

Get Users

Open Recommender Alpha will be me DMing people on Twitter, running the pipeline on their data, sending them recommendations and asking for feedback. The GPT-4 request logs will get auto saved into OpenPipe for fine tuning.

Fine Tune

I will fine tune using a dataset collected from the Open Recommender Alpha. I will also use data collected from running the pipeline over slices of my own twitter data. I will use OpenPipe to do the fine tuning.

I won't bother using user feedback signals / manual dataset filtering or augmentation at this stage, just raw GPT-4 ⇒ Mistral / Llama 7B. The point is to just bring the cost down.

Once the fine tune is done we can test the fine tuned model performance against GPT-4 to check for performance degradation,

Build UI

I will implement a basic YouTube shorts style UI supporting both mobile and web. I don't think I need to bother with auth yet. Open Recommender only uses public data at the moment, so a user's recommendations can just be a public URL that gets DM'd/emailed to them

Lower Priority

Maybe I'll get round to these.

filterShitposts prompt step to filter out tweets which aren't relevant for making recommendations
Make a PR to OpenPipe to support adding user feedback to the request logs to make dataset filtering easier.

Next Steps

DM me on Twitter (@experilearning) if you want to try the current version of Open Recommender :)

Building an Open Source LLM Recommender System: Prompt Iteration and Refinement

Jamesb — Thu, 28 Dec 2023 08:45:29 +0000

The data pipeline looks like this:

So based on analysing your Twitter feed, the system generates YouTube search queries and uses the search API to find relevant videos. It then chops up the videos into clips. All of this data processing is controlled using LLMs, currently GPT-4, but over the next couple of weeks I'm going to be migrating away from OpenAI's closed source expensive APIs towards fine tuned open source models using OpenPipe, a brilliant service for incrementally replacing OpenAI's models with smaller, faster, cheaper fine tuned open source models.

Prompt Iteration

The main focus over the past couple of weeks has been tweaking and improving the reliability of the prompts and data processing pipeline to the point where 8/10 of the recommendations feel interesting. When I started, only half of the recommendations felt relevant, which was quite encouraging because I knew from previous projects that as long as you have a decent LLM program, with enough tweaking it's possible to turn it into something great. I'm happy to report that after many hours banging my head against the wall I have finally achieved the 8/10 quality recommendations goal consistently across runs, at least for my own Twitter data. Here are some of the key things I learned over the past couple of weeks:

Better Tools for Prompt Engineering

We need better tools for prompt engineering. Ideally prompts should be written and auto optimised by an LLM with optional human in the loop feedback. I frequently ran into issues where my approach wasn't working, but I didn't have the energy to try something else because it would take too much time without any guarantee that it would perform better. Just like in programming, you want the experimentation cycle to be short so you can quickly filter through possible solutions to find something that works. But this just isn't possible with prompt engineering right now. It takes a huge amount of time to set up alternative prompts, in-context examples or re-jig your prompt chain to quickly experiment with a different approach. Minimising friction here is essential.

Based on my experience here I started working on a TypeScript library called Prompt Iteration Assistant. I described it as "a set of simple tools to speed up the prompt engineering iteration cycle". It gives you a nice CLI dialog for creating, testing and iterating on prompts. To create a new prompt, you tell it the goal and ideal output from the prompt and it bootstraps a new prompt by getting GPT-4 to write it. It infers the input and output schemas and will support code generation to add the prompt to your codebase automatically.

My goal is to make prompt engineering 10x easier, but it definitely hasn't reached that level yet. I think the DX is nice because the CLI dialogs and code generation makes writing prompts a lot faster, but I don't think this represents the next generation of prompt engineering yet.

A couple of days ago I ran into a really impressive project called DSPy which supports auto generating and optimising whole programs composed of multiple prompts. To quote the docs: "DSPy gives you general-purpose modules (e.g., ChainOfThought) and takes care of optimising prompts for your program and your metric."

Please see my article specifically on "Better Tools for Prompt Engineering" where I go into more detail about these topics.

Optimise the Main Levers and Avoid Cascading Failure

I realised that the createQueries and createClips prompts are the two stages in the pipeline that make the biggest impact on the quality of the recommendations. createQueries controls which queries get sent to the YouTube search API and createClips controls whether and how each video gets split up into YouTube-shorts style clips.

With the createClips function, I was able to improve the quality of the output using traditional prompt engineering techniques. I kept tweaking the prompt and evaluating it against 3 datasets - an unrelated transcript, a moderately related transcript and a completely unrelated transcript to validate that I got the expected output from each one. But I wasn't able to guaruntee the quality of the clips. To make the quality more reliable I implemented a re-ranking prompt for video clips (inspired by RankGPT) to make sure only the best of the best gets recommended. I also added some logic to control the number of recommendations from the same source to make sure there is enough variety in the final recommended clips.

For the createQueries prompt, I made some improvements to the prompt by obsessing over the in-context example I created from my own Twitter data. But I realised that occasional strange queries would always sneak in there and cause a cascading failure of poor recommendations further down the pipeline. One generalisation I have reasoned my way to is that a long LLM program is like a game of Chinese whispers - if you don't build error correction and recovery into the system, your output will get stranger and stranger due to error compounding.

I controlled for this by implementing a filterSearchResults prompt which compares video search results returned from the YouTube API against the user's Tweets and filters out the ones which are unrelated. Importantly I used the user's Tweets to compare against the search results, rather than comparing against the queries or a summary of the user's tweets. This controls against the LLM compounding errors earlier in the pipeline because the LLM may have generated strange queries, or misinterpret something in its summary of the user's tweets. It's better to compare against the "ground truth" for the user's interests which is the Tweets themselves.

In my article on "Avoiding Cascading Failure in LLM Pipelines" I analysed the cascading failure problem in more detail.

Look at Your Data

A week ago I was running the pipeline over my Twitter data and I realised that I was consistently getting strange recommendations that made no sense. Looking at my Twitter likes and tweets I couldn't understand why certain videos had been recommended to me. Why was I getting wrestling video recommendations when I have never tweeted about anything to do with wrestling?

It wasn't until I inspected the raw data getting fed into the LLM requests using OpenPipe's request log web UI that I noticed that there were Tweets included in my Twitter data that I did not recognise. I ran the getTweets function a bunch more times and realised that the unofficial Twitter API I'm using to fetch tweets was returning advertisement tweets interleaved within my own tweets!

I caught another bug in the appraiseTranscripts prompt. I noticed upon re-running the prompt many times over the same video it would output the correct response only 50% of the time. Using my testing setup I was able to quickly debug the issue. I found that the prompt performed fine with 250, 500 and 1000 tokens of transcript context. But frequently fails with specifically 350 tokens of context! The transcript was a video called "The 10 AI Innovations Expected to Revolutionize 2024 - 2025". The correct output would be to classify it as spam.

Here was GPT's reasoning for recommending it in the 350 token context test: "The video uses some buzzwords and makes some broad claims about the future of AI, but it also provides specific examples and details about current developments in the field, such as self-driving cars and drone delivery services.'

My explanation is that with fewer tokens of context, GPT can't appraise the quality of the transcript well, because one interesting nugget can skew the assessment of quality a lot. So even if 50% of the 350 tokens is buzzwords, a couple of quality sentences can "persuade" GPT to recommend it. The funny part is that the 250 token context test passes every time because it excludes a mildly interesting example about self-driving cars! So in conclusion, to get a stable, accurate assessment of average quality you need to pass a larger number of tokens (quite obvious in hindsight).

There are tons of other bugs I caught too like inconsistent in-context example formatting, using the incorrect function name for function call examples and prompt variables that weren't getting replaced.

Next Steps

Now the prompt engineering is done, it's time to start curating a dataset for fine tuning using OpenPipe. This will bring down the cost of running the pipeline and allow me to scale to more users. If you want to try out the recommendations and give suggestions about how they could be improved, please DM me on Twitter @experilearning.

Managing Long-Running LLM Data Processing Pipelines

Jamesb — Wed, 06 Dec 2023 06:56:28 +0000

Looking for a simple abstraction to help you run and debug your LLM data processing pipeline without losing your mind? Look no further!

// Detect dark theme var iframe = document.getElementById('tweet-1732293110988734681-268'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1732293110988734681&theme=dark" }

While working on Open Recommender, my open source, LLM-powered recommendation system for YouTube videos, I quickly ran into issues due to the incredibly slow development loop.

The data pipeline so far. Read more about the project here or watch my video

A full run of the pipeline takes up to 40 minutes to complete for a single user due to the large number of GPT-4 calls involved. When I encountered crashes and bugs I would have to re-run the entire pipeline from scratch without any guarantee of reproducing the error due to the non-determinism inherent in LLM applications. As you can imagine, this became incredibly tiring and annoying.

me.png

The Solution

The solution I came up with was to add individual parts of the data processing chain into a Pipeline class and use this to automatically save the inputs and outputs of each stage while the pipeline is running. This all gets saved to disk in run-id.json files. If a pipeline crashes, I can look at the error, add debugger statements or logging and restore the failed run from a checkpoint, allowing me to immediately debug and fix the failure without running the entire pipeline again from scratch.

How it Works

const pipeline = new Pipeline(opts)
  .addStage(validateArgs)
  .addStage(getTweets)
  .addStage(createQueries)
  .addStage(searchForVideos)
  .addStage(filterSearchResults)
  .addStage(downloadTranscripts)
  .addStage(appraiseTranscripts)
  .addStage(chunkTranscripts);
const recommendations = await pipeline.execute();

simplified from src/pipeline/main.ts

Each stage in the pipeline is just a function with a name and description that takes the previous stage's output as input.

export const getTweets = {
  name: "get-tweets",
  description: "Get tweets from Twitter user",
  run: async function (
    args: GetTweetsStageArgs
  ): Promise<Success<CreateQueriesArgs> | Failure> {
    const { user } = args;
    const tweets = await twitter.tweets.fetch({
      user,
      n_tweets: 30,
    });
    if (!tweets.length) {
      return failure("No tweets found");
    } else {
      return success({ ...args, tweets });
    }
  },
};

The getTweets stage. Simplified from src/pipeline/stages

Since errors can occur at each stage in the pipeline, each PipelineFunction is modelled as function which can either succeed with a value of type T or fail.

export type Success<T> = {
  success: true;
  result: T;
};

export type Failure = {
  success: false;
  result: any;
};

export type PipelineFunction<T, U> = (
  input: T
) => Promise<Success<U> | Failure>;

export type PipelineStage<T, U> = {
  name: string;
  description: string;
  run: PipelineFunction<T, U>;
};

Simplified from src/pipeline/stages

When I execute the pipeline, the Pipeline class iterates over each of the stages, executing them in turn and passing the result of the prior stage as input to the next. All intermediate results are saved into a run-id.json file so checkpoints can be restored later if required.

private saveStage(
  stage: PipelineStage<any, any>,
  result: Success<any> | Failure
) {
  const run = getRunById(this.initialValue.runId)
  run.stages.push({
    name: stage.name,
    result,
  });
  saveRun(run);
}

Now when I encounter unexpected errors, I can add any debugger and logging statements required to understand the issue and re-run the pipeline from the beginning of the failed stage.

yarn main --cloneRunID="<run-id>" --stage="<name>"

This has improved the developer experience 10,000x compared to how it was before!

// Detect dark theme var iframe = document.getElementById('tweet-1731913928630849978-520'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1731913928630849978&theme=dark" }

Improvements

Here are some improvements I'll consider making to the pipeline in the future, especially once I get it running in production.

Create different base classes for different kinds of errors. Eg. RetryableError for errors that the pipeline can recover from.
It's common for parts of a stage to be able to proceed without waiting around for the rest of the stage to complete. Forcing stages to run synchronously means the pipeline runs slower than it could.
- Maybe it's possible to support more concurrency but still have checkpoints to save the results of a complete stage.
I should probably break up some of the stages a little bit more, eg. the chunkTranscripts stage is remarkably slow and if it crashes in the middle it can still take 10 mins to re-run.

Building an LLM-Powered Open Source Recommendation System for YouTube

Jamesb — Tue, 05 Dec 2023 19:09:29 +0000

My main project for the past month or so has been Open Recommender, an open source LLM-powered recommendation system for YouTube videos. It works by taking your Twitter data (tweets, likes, retweets and quotes) and analyses it to infer what topics you are currently interested in. It then searches YouTube to find relevant videos and narrows them down to the clips that are most likely to interest you.

I have been curious about the idea of open recommendation systems for years. I've always found it unnerving how much control third parties have over the content I see. And I'm frustrated by the misalignment between the objective function of most platforms' recommendation algorithms and my personal reason for using the platform - platforms want to keep me scrolling to sell my attention to advertisers. But I want my recommendation system to be built with the purpose of improving my life and helping me make progress towards my goals.

When we go on our Facebook feed or that of any other social media site, we are at their recommendation algorithms mercy. They presumably optimize for clicks, time spent, and endless scrolling. That’s what they want us to do, but is that what we want out of Facebook? - The Importance of Open Recommender Systems - Erik Bjäreholt

What I dreamt of back in university was a system that could infer my interests from my daily activities, like flashcard reviews, reading my behaviour and browsing habits, and use that information to recommend me videos and podcasts from YouTube that I could watch in the evening after school. I never got anything off the ground until a couple of weeks ago when I revisited Erik Bjäreholt's blog post on Open Recommender Systems and realised that LLMs have made sophisticated, customisable and explainable recommendation systems easier than ever to build!

// Detect dark theme var iframe = document.getElementById('tweet-1726290619428131208-109'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1726290619428131208&theme=dark" }

On top of that, massive price decreases and the increased performance of cheap, fine-tunable open source models have made this economically viable too. I reached out to a company called OpenPipe who specialise in helping companies incrementally replace expensive OpenAI GPT-4 prompts with faster, cheaper fine-tuned models and they were kind enough to sponsor all of the OpenAI calls and fine-tuning costs for this project!

They have a super simple drop-in replacement for OpenAI's library which records your requests into a simple web interface to help you curate a dataset and fine-tune a model. I am extremely grateful for their support.

Open Recommender

Goals

Here are some of the goals of Open Recommender.

Understand the user's interests

We will use users' public Twitter feeds as a proxy for their current interests. This is ethical because it relies purely on public information and it's an effective data source because people interact with tweets that are related in some way to their current interests. Of course, not everyone has an active Twitter account, but it's a good place to start. For the time being you can consider Open Recommender to be "the recommendation system for the terminally online".

Customizable and Explainable

No more black box mystery algorithms - LLMs are the perfect replacement! Users can provide custom instructions using natural language and the LLM can provide understandable explanations for its recommendations. Eg. it can explain which Twitter posts influenced its decision to recommend you a certain podcast or interview.

Recommend interesting clips from videos

I want to experiment with recommending smaller units of content, similar to YouTube shorts. There's so much great information buried in 4 hour long podcasts that I don't have time to watch. I want the recommendation system to show me the specific clip I'll find most interesting. Then I can decide whether to continue watching the whole thing.

Recommend "timeless" content

It's not always the case that newer videos are better. Current recommendation algorithms are biased towards trends and virality. In addition to the latest and greatest I also want to be able to recommend old videos which have stood the test of time.

Biased towards learning as opposed to entertainment

I love the user interface of YouTube shorts. It reminds me a lot of incremental reading which I'm a huge fan of. I just wish the content wasn't so sensationalist, clickbaity and trashy.

Current State

I've already finished the MVP of the data processing pipeline. Here's a diagram, or you can take a scroll through src/pipeline/main.ts. It's actually quite a simple set of steps to go from Twitter data to YouTube video recommendations!

Since Open Recommender is open source, you can even run the current version yourself right now by following the installation guide, but be warned - it can get expensive!

The next steps for the project are to continue iterating on the GPT-4 prompts to improve the quality of the recommendations. So far for each pipeline run, roughly half of the recommendations are good and half are kinda meh. The goal is to improve this ratio to the point where 80% of the recommended videos are good. At that point I will transition to fine-tuning to bring down the cost. Then we can start getting some users!

To quote Kyle, one of the founders of OpenPipe:

// Detect dark theme var iframe = document.getElementById('tweet-1724502554762187198-935'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1724502554762187198&theme=dark" }

In the next few weeks I'll also be writing more articles with code snippets and technical details about the lessons I learned building the data processing pipeline for Open Recommender, especially regarding prompt engineering and iteration. Looking forward to any ideas you have and can't wait to share the progress with you!