2point0.ai

Vibe coding

Wed, 12 Feb 2025 00:00:00 GMT

Vibe coding

I like to think of myself as an “old-skool” dev - though let’s be honest, that’s just a hip way of saying I’m an old dev that’s been writing code since another millennium.

It wasn’t that long ago that professionals working on the web would describe their work using the language of a “craftsman”. We were digital artisans, building pixel perfect designs, creating delightful digital experiences and shipping hand-crafted code that we were proud of.

These days developers have a new language that captures the zeitgeist of the modern AI engineer: Vibe coding.

Karpathy goes on to describe vibe coding in practice as some kind of slider.

@karpathy

All the way on the left you have programming as it existed ~3 years ago. All the way on the right you have vibe coding.

I like the slider analogy and see this as a transition from craft to vibes. It’s the modern day equivalent of when textile work shifted from a highly skilled craft to an entirely automated process. The transition didn’t occur overnight, but one Spinning Jenny here and a Jacquard Loom there, and the next thing you know the craft of textile working is no more. The march of progress is inexorable, and we’ve seen that transition repeat many times through history.

But where exactly are we on this slider? And if I can stretch the analogy even further, if our slider is a DJ’s cross-fader, are we in for a quick chop, or is this going to be a long fade mix?

Current state of play

I recently spent a few days with a company that are cranking that vibe slider as far as they can. They’re building their own agent tooling that will take a boilerplate app and spec, and use AI to create a fully functioning app in 15 to 30 minutes.

It’s really impressive to watch an agent loop over the spec and iterate on the build step by step. What they’ve built so far is great, and one of their generated apps has been out in the wild for a while and is already generating revenue.

But… their generated apps are still what you might consider (and I’m quoting their CEO here), “shitty little apps”. We’re talking single feature apps - a small database, user auth, a handful of routes and a couple of endpoints. For a competent dev, it’s the kind of thing you could bash out in a day or two.

I can pick other bones too. When the agents struggle with a more complex feature, what they’re typically doing is hand writing the code and wrapping it in a function so the AI-generated code can implement the feature just by calling a function. That’s a totally pragmatic thing to do, but we’re already edging that slider back towards craft coding.

I think what this company is doing is really cool, and their ambition is far greater than shitty little apps. But I mention all this because I think it reflects the messy reality of vibe coding. Anyone who’s used Cursor for anything more complex than a one-shot Flappy Bird clone will know there’s a lot of effort involved in prompting, providing context, and testing and validating outputs. Without a skilled human pulling the strings, the vibes can get ugly, quickly.

Prompting and context challenges

Prompting isn’t just about describing a task or instruction. Agents need to know everything about your codebase in quite a lot of detail. They need to know the stack, the key dependencies, and they need to see lots of examples of how you actually want them to code.

And as a project and its architecture evolves, so too must the prompts and fragments of prompts that your agents depend on. Managing all of this is quite a lot of effort. Being able to prompt clearly and effectively is a skill that not every developer is going to be blessed with. But for vibe coding, it’s an essential.

There are sites popping up for users to share .cursorrules files for different stacks, and some developer tools and libraries are starting to share prompt fragments in their documentation. This is all great to see and does help the vibe coder find their way in all this.

Context management poses another challenge. In principle, agents work better when they are shown the right context at the right time, and not necessarily all the context all the time. Ideally, IDEs would make this just work™, but in practice this is tricky stuff to get right and another messy aspect of vibe coding that developers need to grapple with.

Validation challenges

When we’re writing code ourselves, we inherently validate our work as we go. But with vibe coding, validation becomes a critical challenge that can make or break the entire approach. We can either fly blind, or we can implement robust validation strategies and take advantage of mathematics and probability.

The idea here is if an agent succeeds only 10% of the time, if you run enough of those agents (88 I think, but someone brainier than me can explain the maths), then there’s a 99.9% chance at least one of them succeeds.

All that depends on having good tests and checks in place to determine when an agent gets it right. There’s some low-hanging fruit here: your agents should definitely be seeing your Typescript compile errors and linter warnings for a start.

And then there’s testing. In a vibe coding maximalist world the AI will write the tests too, but for now this is a role for developers whilst they push the vibe slider as far as they’re comfortable with.

UI and UX challenges

If moving the vibe coding slider to the right comes at a cost, by far the greatest cost in my view, is the detrimental effect on quality of design and user experience.

Whilst agents can do a reasonable job throwing components together using frameworks like shadcn and similar, let’s be honest, AI-gen design is always very, very average. We’re talking functional but uninspired layouts, predictable typography and color schemes that you know you’ve seen a thousand times over. It’s bland, generic, just “meh” kinda stuff.

Sometimes that’s fine. If it’s a private or internal project, we can make these kinds of compromises. But for a real, consumer facing, commercial app, I think the vibe coder falls well short of any craft coder with good design chops.

There’s a missing piece here: not just models that have visual understanding, but models that are specifically trained on decades of examples of great design and can review and can improve their own code from a UX point of view.

I’m sure somewhere there are some smart people working on exactly that, but for now I think vibe coders just fundamentally don’t have the tools for doing good design and UX.

The times, they are a changin’

The software engineer, as a role, is well and truly in the mix. You can hear the beat - DJ Vibes is moving that slider from left to right. But I don’t think we’re quite as far into the transition as others seem to think.

Sahil’s post doesn’t really pass the sniff test. Not withstanding the obvious fact that if you don’t have junior devs, then it isn’t long until you don’t have senior devs either, I simply don’t think the tooling and models are ready for this. But as a vision for what a maxed out vibe coding environment looks like… maybe?

I still don’t believe coding is dead, and I don’t believe junior or mid level devs are done for, but I do believe this slide to the right is in play. The role of a developer or software engineer, across all levels, is fundamentally changing. Whilst we can dream of sitting around on beanbags, chatting, and brainstorming while the agents do the coding, the reality is we’re trading writing code for wrestling with prompts, managing context, and writing many, many tests and validators.

Conclusions… any?

The slide from craft to vibes isn’t a clean transition - it’s messy, complex, and full of hard problems that need to be solved - and we’re still very much in the early days of this transition.

The reality is that effective vibe coding today requires a peculiar mix of skills: traditional coding expertise, prompt engineering finesse, and a willingness to test outputs far more vigorously than you’d test your own outputs. It’s less about replacing development skills and more about augmenting them with new tools and approaches.

Just as the industrial revolution was a more gradual transition than the age’s technological advancements suggested - a transition limited by the practical realities of human adaptation - the shift to vibe coding will likely be just as nuanced. The greatest opportunities will come to those who can navigate this gap between AI capability and human readiness.

For now, the slider sits closer to craft than vibes. The future may yet bring Sahil’s vision of developers as spec-writers and AI-wranglers into reality, but the path there isn’t as straight or as short as some might suggest. The beat goes on, the mix continues, and somewhere between craft and vibes, we’ll find our groove.

Announcing Agentflow

Tue, 17 Dec 2024 00:00:00 GMT

Announcing Agentflow

I’m very pleased to announce the launch of Agentflow - a powerfully simple AI agent framework.

Agentflow enables developers to create AI workflows that read like documentation but execute like code. By combining natural language with minimal markup, you can compose complex agent behaviours using plain English as your main programming language.

The framework provides all the structure and control you’d expect from traditional programming but is expressed in an intuitive Markdown-based format that puts readability first.

Not another one. Why?

The path to Agentflow began with a simple frustration. I am not a Python developer. Frameworks like CrewAI and LangChain dominate the AI agent space, but I found myself wanting similar tooling using languages I was more comfortable with and that better suited my development style.

I’ve also noticed a trend that many agent frameworks are quite complex and enterprise-focused. I didn’t want to set up servers and infrastructure; I just wanted something lightweight and easy that I could hack around with on my own computer.

Not too many moons ago I shared my work on Shifts, an Elixir-based agent framework inspired by CrewAI. I definitely learnt some lessons building Shifts, but it didn’t feel like I’d nailed it. So, I put it down to think about the problem again.

There’s an old tweet from Andrej Karparthy that simply says, “The hottest new programming language is English”. I love that idea, and it got me thinking… What if we could create a framework that embraces that idea, and puts the human intent front and centre? Rather than bringing prompts into code, could we somehow make the prompts the code? This thinking led directly to what would become Agentflow.

What makes Agentflow different?

From the start, Agentflow embraces natural language as its primary input. Instead of constructing agents with code that contains prompts, with Agentflow you write prompts that contain minimal code. This makes the composition of agents simpler, but more importantly, it puts human intent front and centre of the development process.

Workflows are written using a variation of Markdown similar to MDX (in fact, it uses the MDX parser internally). The result is a format that is as readable as plain text and with enough structure to implement loops, conditional branching and call actions inside the workflow.

Staying true to this text-first philosophy, Agentflow operates entirely through a command-line interface - no GUI required. It’s lightweight and easy to install, with no infrastructure or cloud requirements. You can get everything up and running on your machine in under a minute.

Under the hood, Agentflow is powered by Vercel’s excellent AI SDK. This means baked-in support for any AI provider supported by the SDK (that’s pretty much all of them), and agents can use any third-party tools compatible with AI SDK. The result is a framework that is as capable as it is approachable, whether building simple automations or sophisticated AI-powered workflows.

Looking ahead

Agentflow is still in its early stages and there’s a lot of work ahead. Like any new software, there will be bugs to fix, rough edges to smooth out, and for now the design and APIs are all subject to change as I learn how people use Agentflow out in the wild.

And that’s exactly why I’m releasing Agentflow now. I want to see how people use it, learn from their experiences, and understand what they need. All feedback is valuable at this stage, whether it’s bug reports or feature requests. While there are clear areas for improvement, such as expanding the collection of first-party tools, or thinking carefully about security when users start sharing workflows and actions, the exact direction will be influenced by users’ needs and experiences.

My hope for Agentflow is simple: I want to make it easier for anyone to start experimenting with and building AI agents. Whether you’re an experienced developer or a weekend-only hacker and tinkerer, Agentflow’s JavaScript foundations and natural language programming should feel intuitive and accessible. I’m excited to see what people build with it, how they extend it, and where they take it.

Get building

Ready to try Agentflow? Head over to the Agentflow docs - the getting started guide will walk you through installation and creating and running your first workflow.

For those interested in reporting bugs, getting involved, or just reading the source code, head over to the Agentflow repository on GitHub.

I’m looking forward to seeing what you build with it!

Zed AI review: Getting the balance right in AI code editing

Fri, 20 Sep 2024 00:00:00 GMT

Zed AI Review

Developers are unwaveringly loyal when it comes to their choice of code editor. Just ask anyone who’s spent the time and energy needed to master vim… that’s a path they will never turn back from.

For developers, switching code editor is a rare - almost epochal - event. My own journey from Dreamweaver (yep!) to Sublime Text to VS Code is one that played out over 20 years. But recently, with AI becoming an essential part of many developers’ tool kits, a new generation of code editors that place AI front and centre of the developer experience are emerging to shake up the status quo.

Whilst Cursor has been grabbing all the headlines, it is far from the only player in town. Last month, Zed announced Zed AI, its own AI-assisted coding tools packaged within an already impressive code editor.

For the past few weeks I have been using Zed as my daily driver. This article won’t be a full review of Zed, but I will share my experiences and impressions of Zed’s AI features after a month of pretty heavy use. Is this one of those epochal moments? Will I be switching for good? Read on to find out.

What is Zed?

I first came across Zed earlier this year when I noticed a few developers in the Elixir community speaking highly of it. Launched in 2023, Zed is founded by the team who built the GitHub Atom editor. In Zed, they’ve built a fast, modern code editor. It uses its own highly performant rendering engine built with Rust, and Zed has been steadily winning fans by focussing on the developer experience and getting the basics right.

The introduction of AI features adds to Zed’s previous attention to collaborative chat and pair-programming features. I’m unsure if this represents a slight pivot in plan, but either way Zed AI turns Zed from an interesting looking editor that didn’t quite have enough to peel me away from VS Code a few months ago, into an editor that I definitely want to take for a spin now.

Zed AI features

Zed’s AI offering can broadly be split into three main features:

The assistant panel for chatting with AI about your code.
The inline assistant for using AI to directly transform your code.
Slash style assistant commands for injecting additional context into your chats and interactions.

Let’s take a look at each:

Assistant panel

At a glance, the assistant panel looks and feels like every other AI chat interface you’ve ever used. But it’s definitely not quite like any other chat interface. The entire chat - what Zed calls a “context” - is one continuous text file that feels the same as working with any other text file in Zed. You can move the cursor anywhere, work with multiple cursors, use key bindings, and so the process of adding bits of code to your context feels completely natural and seamless.

There is keyboard shortcut (cmd->) for inserting selected text from the main editor into the assistant panel in a fenced code block. There is also a range of slash commands (which we’ll explore a bit further below) for injecting code into your context - for example an entire file or directory of files. These injected elements appear in the assistant panel as a toggle-able element that can be expanded in full or minimised, so your focus can remain on your prompt.

When you’re ready, hitting cmd-enter submits your context and the AI assistant streams a response to the bottom of the context.

One useful side effect of the entire context being an editable buffer of text is that it’s trivial to “rewrite history” by removing or editing previous parts of the conversation. You can fork an entire context by simply copying and pasting the entire thing into a new context and taking the same base conversation in two different directions.

I’m not aware of any other chat UI working quite like this, and so the Zed team deserve some applause here. They’ve nailed the UX, and I find myself wanting to do other, non-coding tasks in Zed (for example, writing drafts for this blog) purely because I love the experience of flipping to and from the assistant panel so much.

Inline assistant

Wherever you can select text - the main editor, the terminal, even the assistant panel - you can invoke the inline assistant and prompt it to directly transform your selected code.

When you invoke the inline assistant (ctrl-enter), if you also have the assistant panel open, that is provided as context in addition to your inline prompt. That’s not totally intuitive, but once you understand that’s the way it works it’s a powerful way to combine a detailed prompt as context with a simple inline instruction.

The transformed code is then presented to you as a diff which you can either accept or reject. You can also use multiple cursors and highlight multiple selections to invoke inline assistant in several places at the same time. I use this all the time for adding doc blocks to functions.

One recipe in the Zed docs for fixing errors is to create a new context in the assistant panel, use the /diagnostics and /terminal commands to inject diagnostics and error logs into the context, select the misbehaving code and use inline assistant to generate a fix. It works wonderfully.

Assistant commands

Within the assistant panel, you can hit forward slash (/) to bring up a list of available commands. These commands effectively inject content from your code or elsewhere in the editor into the current context.

For example, the /diagnostics command will inject any output and warnings from the language server; the /file command allows you to select a file or folder of files to insert into the context; and the /fetch command injects the response of a HTTP request. All the built-in commands are listed in the Zed docs.

If the built-in commands don’t serve all your needs, you can create your own. Well, apparently you can. Documentation is a bit thin here so I haven’t tried to create a custom slash command, but in theory this could be a very powerful way to inject context from other sources, such as documentation, websites and APIs, or other local files.

One command worth special mention is /workflow. If you invoke it and expand the toggle, it reveals a giant system prompt that instructs the LLM to guide the user through a series of stepped changes and respond in a very specific, structured format. The assistant panel then recognises this structure and presents the “steps” with a UI where you can work through the changes step by step.

You can see where they’re going with this. This brings Zed closer to Cursor’s Composer view, where you can work with the assistant iteratively to build out a specific feature or refactor purely through prompting. In my testing, I couldn’t really get /workflow to work reliably. It would produce a multistep workflow which had “step resolution errors”, or some steps would repeat or conflict with previous steps. It’s a promising looking feature, but in practice feels like you need to cross all your fingers and make a sacrifice to the AI gods if you want it to actually work.

What do developers want from AI in code editors?

I know quite a few experienced software developers who are either irrationally anti-AI, or just apathetic towards AI. On the flip side, many younger developers are attracted to the idea of a prompt-driven, no-code software development process - just see all the “How I used Cursor to build this [INSERT APP/GAME] clone in MINUTES” videos doing the rounds on social media.

Needless to say, I think both these takes are missing the mark. The real gains of using AI in software development come from a more nuanced middle ground.

I use AI every single day for things like:

Thinking through a problem, and comparing and evaluating different high-level solutions.
Writing enough code to “show me the way” so that I can pick up a pseudo implementation and use it in a way more specific to my code base.
Fixing errors by sharing the error and the code and letting AI generate the fix.
Chores like writing unit tests and adding code documentation and annotations.

What I don’t expect or want AI to do is write every single line of code in my app or product. I can’t see how it’s possible for that to happen without myself being so disconnected from the process that I no longer feel I can influence, or be responsible for, the quality (or lack of) of the code. Even if the models get better over time, and I expect they will, I feel this is a misguided aspiration that will result in worse developers, worse code and worse products.

So for now, the flashy, whizz-bang Cursor demos don’t really impress me, and don’t reflect what I’m looking for in an AI coding assistant.

Zed improves the dev experience with balanced AI features

Which brings us back to Zed. How does it really stack up?

First off, before we even start talking about AI, Zed is a solid, modern and very fast code editor that gets the basics right. Being a new editor, it’s not without bugs, but I didn’t come across anything other than minor UX/UI annoyances. It also lacks support for one or two more niche language syntaxes, but all popular languages are well-supported.

Zed’s AI features are refined and thought through. The UX of the assistant panel feels integrated and seamless. Treating the entire chat “context” as one giant text file feels so natural. And when we need to inject context, the built-in slash commands are simple, intuitive and work perfectly.

When you need AI to directly transform your code, the inline assistant works great. Combining the assistant panel with multiple cursors and inline assistants allows for some pretty powerful and efficient workflows.

One thing Zed doesn’t have is AI tab autocompletion. I know some people love it, but I’ve always felt that’s like Russian Roulette coding.

And this is why I say Zed’s AI features feel thought through. Has tab autocomplete been omitted because they just haven’t built it yet, or did they weigh up how much value these features add to the developer experience first? I get the feeling it’s the latter.

I’m sure they will add more AI over time - the /workflow command feels very much like a work in progress, and it will be hard to resist trying to keep up with the more AI-centric approach of Cursor over time. But I hope and trust Zed keep their focus on developers and the developer experience. At the moment, they’re striking the balance well.

Zed’s hosted AI features are powered by Claude and are currently available for free. I presume that won’t always be the case, but there are no details currently on pricing. However, it’s also possible to add your own API keys for whatever AI providers you fancy, including Anthropic. So if and when it becomes a paid service, you should be able to run with your own keys if preferred. This, at least to me, makes it a more attractive prospect than Cursor where the only option is a subscription.

Conclusion

So, after a month of use, are we witnessing the beginning of a new epoch? Is VS Code now destined to gather dust in my dock whilst Zed gets all the attention? The answer is, yes!

I’m genuinely loving using Zed, and the way it’s AI features are designed and integrated feel just right for me and what I wish from AI at this time.

But… here’s the catch. I’ve compared Zed to Cursor a few times in this article, and I’ve poured a bit of scepticism on the whole “I built Notion in minutes” thing… but I’m not being entirely fair here because I’ve not tried Cursor to the extent that I’ve been using Zed this past month.

So, my challenge for October is to put Cursor through its paces too. We will see if it wins me over or changes my view on exactly what kind of AI features I think I need. So stay tuned, subscribe to the feed and follow me on X to be the first to read my Cursor review next month.

California's AI safety bill has passed. What does it mean for open source AI?

Fri, 30 Aug 2024 00:00:00 GMT

SB-1047 AI safety bill

SB-1047 vetoed

On 29 September 2024, one month after publishing this article, Governor Gavin Newsom vetoed SB-1047, citing the “chilling effect” the bill could have on the competitiveness of California’s AI industry.

It’s finally happened - California’s AI safety bill (SB-1047) was passed by the State Legislature. The bill now sits on Governor Gavin Newsom’s desk and is one rubber stamp away from becoming law.

So, I guess that’s it? Time to pack up and move onto something else… I mean, for months the AI chattering classes have been telling us how disastrous this bill will be, how it will restrain innovation and kill off open source AI.

Well, not so fast. You see, this story is as old as the hills. Every new and disruptive technology faces similar scrutiny. The legislators have heard all the arguments on both sides a thousand times over, and their job is to strike a balance of protecting society without dampening innovation.

So, now that the act is a formality away from becoming law, let’s take a step back and look at what this bill really means for AI. In this article I’ll offer me layman’s interpretation of what the act covers (and doesn’t cover), and what the real world implications for AI will be, particularly open source AI.

What is SB-1047

To give the bill it’s full name, SB-1047 is the “Safe and Secure Innovation for Frontier Artificial Intelligence Models Act”. The purpose of the bill is to ensure safe and secure innovation for frontier artificial intelligence models in California.

The legislation is of global significance because AI is a highly disruptive emerging technology and the laws are playing catch up. Regulators around the World are watching California carefully, and this bill will likely set in motion a precedence that other states and countries will follow.

And, if you take the view as I do, that AI represents a new industrial age, then the US getting this wrong could hand its technological lead to its competitors, with geopolitical consequences.

With the stakes so high, it’s natural that those in the AI industry are anxious of premature-regulation. Of course, the devil is in the details, so let’s dive into it and explore some of the nuance that is often missed in discussion.

Breaking down the bill

First of all, this bill sets out a couple of crucial definitions that define the scope and purpose of the bill.

”Covered models”

The bill defines specific thresholds for what models are covered:

An AI model trained using computing power greater than 10^26 floating point operations (FLOPS), with a cost exceeding $100 million.
OR, a model created by “fine-tuning” a covered model using computing power greater than 10^25 FLOPS, with a cost exceeding $10 million.

10^26 is a number with more zeros than my brain can handle, but $100 million I can just about cope with. It’s a lot of money.

”Critical harm”

The bill provides a specific definition of “critical harm”, as any of the following harms “materially” enabled by a covered model.

The creation or use of chemical, biological, radiological, or nuclear weapons resulting in mass casualties.
Mass casualties or at least $500 million of damage resulting from cyberattacks on critical infrastructure.
Mass casualties or at least $500 million of damage resulting from an AI model acting with limited human oversight that would constitute a crime if committed by a human.
Other grave harms to public safety and security of “comparable severity” to the above.

The bill also explicitly excludes:

Harms caused by information that is otherwise reasonably publicly accessible.
Harms caused by a covered model combined with other software if the covered model didn’t materially contribute to the harm.

The thresholds for both definitions get reviewed after 1 January 2027, and monetary amounts adjusted annually for inflation.

These two thresholds quite clearly show that this is a bill designed to capture very large, very powerful AI models, and is specifically focused on severe, large scale harms like terrorist events or cyberattacks on critical infrastructure.

Requirements for developers

The bill outlines several requirements fort developers of covered models:

Developers must register their covered models with the California Department of Technology.
Comprehensive evaluations must be conducted to identify potential critical harms.
Appropriate measures must be implemented to prevent unauthorised access, misuse, or unsafe modifications of the model.
Any critical harms or near misses must be promptly reported to the authorities.
Developers must maintain the ability to cease operation of the model if necessary to prevent critical harm.

As well as creating lots of paperwork for everyone involved, these requirements aim to ensure responsible development and deployment of frontier models, while maintaining transparency and accountability.

What SB-1047 is not?

I suspect most people would agree that ensuring terrorists can’t use AI to learn how to build and use biological weapons is, on balance, a good thing. But, I also think people have more immediate concerns about AI. Things like:

Mass unemployment and re-skilling in the face of automation.
The environment and the huge amounts of energy consumed by training AI models.
Copyright issues and the industrial scraping of internet data for training AI models.
Easy access to tools that can create highly realistic deep-fakes and deceiving images.

Guess what? SB-1047 has nothing to say about any of that. The bill is as minimal in its scope as you could expect. I mean, if you don’t draw the red line at chemical and nuclear weapons, then where do you?

Impacts on open source AI

Whilst SB-1047 does not specifically prohibit open weight models, it creates a regulatory environment that could discourage developers of covered models from releasing them openly. For example:

“the developer shall […] prevent unauthorized access to, misuse of, or unsafe post-training modifications of the covered model”

the developer shall […] implement appropriate safeguards to prevent the covered model and covered model derivatives from causing or materially enabling a critical harm

a developer shall not […] make a covered model or a covered model derivative available for commercial or public, or foreseeably public, use, if there is an unreasonable risk that the covered model or covered model derivative will cause or materially enable a critical harm

Taken at face value, these provisions seem to be incompatible with a fully open release. Developer’s of closed models can design safeguards into their own systems and infrastructure, but once a model is out in the wild the developers lose all practical control - and they suddenly become liable for the actions of others.

Safeguarding AI is hard

You might be thinking, why can’t the models just be released without the capability of teaching us how to do things like build biological weapons in the first place?

It seems a reasonable question. After all, a model just reflects its training data, so just don’t train it on the bad stuff, right? Unfortunately, it’s not as simple as that.

Large language models can develop emergent capabilities they weren’t explicitly trained for. They can combine information in novel and unpredictable ways, leading to unexpected capabilities.
Much information that could be used for harmful purposes, also has legitimate uses. For example, knowledge about chemistry or biology that could be used for important medical breakthroughs, could also be used to create weapons.
The sheer vastness of possible inputs and outputs makes it practically impossible to anticipate all potential harmful uses.
Users will always try (and often succeed) in circumventing trained in safety measures through clever prompting techniques.
Some potential harms, such as cyberattacks, don’t even require any specific “dangerous” training data to be a threat.

There is a whole area of research around “model alignment”. Ensuring AI systems behave in ways that are aligned with human values and intentions is complex and not fully solved.

The bill recognises these complexities and so focuses on risk assessments, monitoring and auditing, and the ability to “shut down” the model. All of which becomes very difficult once that model is released openly and out in the wild.

Is this regulatory capture?

It’s worth noting that as it stands today, there probably isn’t any open-weight model that would fall within the act’s definition of a “covered model”. Even Llama 3.1 405B (the largest open weight model to date) probably falls short.

But, it’s not far off. And there must now be question marks over whether Meta will ever release an even larger model openly.

Time will tell, and there’s certainly enough “unreasonable risk” and “materially contribute” type language in the bill to give Meta’s lawyers something to chew on.

But, if this does discourage future frontier models being released openly, then that’s a scenario that clearly benefits the developers of closed, proprietary models, and all but closes the door to new entrants to the market. That would point towards a future where access to the very best and most powerful AI models is exclusively controlled by a handful of all-powerful tech giants.

Winners and losers

At first glance, SB-1047 might seem like a win for those advocating for stricter AI regulation. However, a closer look reveals a more nuanced picture.

The bill’s light-touch approach and narrow focus means that many in the AI industry, particularly the big AI labs, may be breathing a sigh of relief. Earlier in the week, Anthropic’s CEO Dario Amodei said of the bill, “we believe its benefits likely outweigh its costs” and Elon Musk tweeted that “California should probably pass the SB-1047 AI safety bill.” - statements that suggest to me that they were happy with what was on the table.

On the other hand, those hoping for more comprehensive regulation may feel the bill does not go far enough. The focus on preventing catastrophic harms, while important, leaves many day-to-day AI ethics issues unaddressed.

The most significant impact is the bill’s potential to discourage the release of fully open, frontier-level models. While not explicitly prohibited, the bill’s provisions will make it challenging to release covered models openly. Whilst I don’t believe this in any way “kills open source AI”, as it only targets frontier models, this does potentially take us to a future where the most advanced AI capabilities are concentrated in the hands of a few large corporations.

In essence, SB-1047 strikes a delicate balance. And perhaps the legislators wouldn’t be doing their job if both sides of the debate weren’t at least a little bit unhappy.

AI's open source summer revolution

Mon, 29 Jul 2024 00:00:00 GMT

AI's open source summer revolution

It’s summertime and the living is easy. Unless, that is, you’re trying to keep up with events in the AI world. If you thought you could take a week’s holiday without missing some paradigm-shifting event, think again.

The last couple of weeks have seen a flurry of announcements and developments, forcing this one relaxed holidaymaker out of vacation mode and back into blogging mode. Let’s dive into it.

Meta releases Llama 3.1, including a 405B parameter model rivalling proprietary heavy weights.
Mistral unveils two new models: the powerful Mistral Large 2 and the developer-friendly Mistral Nemo.
Ollama introduces native support for function calling, enhancing local LLM capabilities.

For developers, builders and AI enthusiasts, there’s some juicy stuff in here. So let’s take a deeper dive into what’s new.

Meta launches Llama 3.1 models

The big news came from Meta, who have released new versions of Llama 3, with increased context size (128k), better multilingual abilities and tool calling.

This includes the release of a new 405B parameter model that offers “frontier level” capabilities - performance that was previously only available in the most advanced proprietary models like GPT-4o and Sonnet 3.5.

When I started this blog - only seven months ago - GPT-4 was a country mile ahead of everyone else. Now, not only have other proprietary models caught up (and overtaken), for the first time frontier level LLM capabilities are available to everyone to work with and build on.

This really is a huge moment. Mark Zuckerberg writes about the philosophy and arguments in favour of open source AI, and there’s a fair amount of strategy at play here too. All eyes now turn to OpenAI to see if they have any GPT-5 shaped tricks up their sleeve, or if their moat truly has evaporated.

Of course, unless you have a small GPU cluster in your garage, you or I won’t be able to run the 405B model at home. Luckily, Llama 3.1 also comes in 8B and 70B parameter sized versions for us to prompt, RAG, fine-tune, optimise and just generally have fun with.

If you want to try the 405B model, the easiest place is probably to hop on to Hugging Chat (unless you’re an EU citizen, in which case, VPN is your friend). The smaller 8B and 70B models are widely available on popular chat platforms, or can be downloaded from Hugging Face or Ollama.

Mistral launches two new models

In a case of exceptionally poor timing, whilst everyone was talking about the Llama 3.1 news, Mistral have also launched two new models that risk being overlooked, but definitely shouldn’t be.

Mistral Large 2 is a 123B parameter model, with 128k context size, significantly improved capabilities and impressive benchmarks all-round. Following on from Mistral’s work on Codestral, they claim Mistral Large 2 compares favourably to Llama 3.1 405B in Maths and Coding.

More exciting to me than the large model, is its little sibling Mistral Nemo, a new 12B parameter model with 128k context size, built in collaboration with NVIDIA with an entirely permissive licence, and looks to be an absolute banger for its size category. This is a drop-in replacement for Mistral 7B, and will be sure to keep the fine-tune community busy over the coming months.

Ollama function calling

One thing Llama 3.1 and the new Mistral models have in common is significantly improved support for tool use and function calling, built in to the model’s training. Which is all very convenient because my favourite tool for running LLMs locally, Ollama, has shipped version 0.3 with first-class support for tool use / function calling.

Until now, it’s been kind of possible to implement function calling with Ollama, but the models weren’t really explicitly trained for it and the implementation involved too much double-sided sticky tape for my liking.

This release makes tool use a first-class feature of Ollama, supported in both its own chat API and the OpenAI-compatible endpoint. Combined with new models that are explicitly trained to handle tool use, all of a sudden the prospect of plugging local models into agentic systems is no longer a pipe dream.

Time to build

If you’re a developer, and you’re building with large language models, what a couple of weeks! The big picture is that for the first time, state of the art, frontier LLM capabilities are available to everyone to use and build on. But it’s not just a case of the bigger models getting bigger and better, the smaller models are improving too. Models that anyone can run on their local computer are as capable today as the leading proprietary models were a year ago.

With the capabilities of models improving across all size categories, and the tooling and dev infrastructure improving all the time, doors to opportunities open up that just weren’t possible a few months ago.

It’s a great time to be building in this space. I had to jump out of holiday mode to tell you this, but now I need to get coding. There’s stuff to build!

Claude 3.5 Sonnet takes the AI crown (for now)

Wed, 26 Jun 2024 00:00:00 GMT

Is Claude 3.5 Sonnet the new King?

Last week, Anthropic announced Claude 3.5 Sonnet, the latest update to their Claude family of models. As is usual with each new state-of-the-art LLM release, the new Sonnet raises the bar across a range of industry evaluations. And, in an unusual shocker for AI announcements, they even released the model for us to play with on the same day!

The AI community on Twitter / X have very quickly reached seeming consensus that the new Sonnet is indeed very good. Perhaps even, a GPT-4 killer?

So, how good is it? Has GPT-4 been dethroned? Let’s take a look.

Understanding Claude 3.5 Sonnet

It doesn’t seem that long ago that Anthropic were announcing Claude 3. In fact, it was literally only three months ago, but this is AI and space-time works a little differently here.

Claude 3 was delivered in three variants, positioned to give users a choice with clear trade-offs. Haiku, the smallest model, prioritised speed and cost. Sonnet, an all-rounder, the medium model that is best for most things. And Opus, the premium model, more intelligent and capable, but slower and more expensive.

Claude 3.5 Sonnet is the first of the three variants to be upgraded. Anthropic’s own benchmarks show it out-performing Claude 3 Opus, their previously best model, on all measures. It also outcompetes OpenAI’s GPT-4o model on all but two benchmarks.

The fact that the new Sonnet now outperforms the old Opus is interesting as that negates the role of the slower and more expensive Opus. An even more impressive Opus must surely be around the corner, which is a tantalising prospect.

Notably, Sonnet can be used free of charge (with usage limits) through claude.ai - Anthropic’s own consumer facing chat UI. There’s an interesting battle playing out here. OpenAI recently made GPT-4o their standard, free-to-use model, which is also a very good, free model. But Claude 3.5 Sonnet seems to have the edge, and makes claude.ai a very compelling choice for new users wanting to experiment with adopting LLMs as part of their daily lives.

Artifacts

In addition to the new model, claude.ai receives a new feature called “Artifacts”. Any documents that Claude generates, such as text documents or code files, are displayed in a dedicated column next to the chat, and code is previewed in a sandbox-like environment.

It’s conceptually simple but a great UX. It creates a more seamless and natural flow when iterating on code files step by step. If you want your imagination piqued, take a look at these examples of folks putting the new Sonnet and Artifacts through its paces.

There are some response token size limits that larger artifacts may bump in to, but overall Artifacts is an important step for claude.ai - it marks its evolution from a conversational chatbot to a collaborative work environment - a tool to help you write documents and code, all wrapped in a streamlined UX.

With these improvements, Claude is positioning itself as a formidable competitor to ChatGPT. In fact, could “competitor” be underplaying it?

Is the King dead?

Claude’s update follows hot in the heels of the release of GPT-4o, OpenAI’s latest flagship model. GPT-4o is faster and cheaper than GPT-4 Turbo, with much improved multi-modal vision and audio capabilities, and replaces the now ancient (in AI terms) GPT-3.5 as ChatGPT’s default free-to-use model.

Whilst GPT-4o undoubtedly raises the bar in terms of its multi-modal capabilities (the announcement post features some great demo videos), for various - erm “reasons” - many of the voice and multi-modal features are not yet in the public’s hands. In my admittedly subjective evaluations of GPT-4o, I find it falls short of GPT-4 Turbo for the kind of tasks that I typically use it for.

This leaves me confused about the GPT family of models right now. Everyone was expecting GPT-5, but we got GPT-4o. It’s billed as their flagship model, but is clearly inferior to GPT-4 Turbo for some types of tasks. OpenAI has hinted that GPT-4o is the first in a family of new models. But if so, where is the rest of the family? And when will we be able to talk with that flirty Scarlett Johansson sound-alike?

For now, I think it’s fair to say the crown has fallen.

Conclusion

Instead, the crown currently and deservedly belongs to Anthropic. I like the options their line of models give me, and I’m excited for new Haiku and Opus models. I love that their free-to-use model is currently the best model on the market. And I love Artifacts and the direction claude.ai is heading. The boys and girls working at Anthropic are on a roll, and I’m a fan.

But only a fool would be writing OpenAI off. The question is not whether they will be able to raise the bar again, of course they will, but by how much so? Will it be to such an extent that they cement their position as market leaders for the next 12–18 months, as happened with GPT-4, or are we set for a closer run thing, with smaller and more incremental upgrades, with the gap between all the top LLM models narrowing?

There is one clear winner in all of this - us! The rising tide lifts everyone, and all users get to benefit from the increased choice and flexibility this competition brings.

But what excites me the most is that Sonnet is not some new bigger and more expensive model, it’s Claude’s medium-sized model, and remains the same cost. With improvements coming across the family of models, we will find that tomorrow’s smaller models will be as cpable as today’s larger models. If that trend continues, this brings prices (and energy demands) down, and opens access to advanced AI to many more potential users.

Unpacking Apple's AI Strategy at WWDC 2024

Wed, 12 Jun 2024 00:00:00 GMT

Apple Intelligence at WWDC 2024

Apple’s Worldwide Developer Conference 2024 (WWDC) kicked off this week, where, as expected, The Cupertino tech giant finally revealed their hand and demonstrated how AI will be coming to their devices and operating systems later this year.

With Microsoft and Google racing to out-AI each other, Apple have appeared to be behind the curve. Apple are rarely first to a category, but when they enter, it’s always time to pay attention. Their approach could shape how a billion consumers worldwide perceive and interact with artificial intelligence in their daily lives.

In this article, we’ll run through the key announcements from WWDC 2024, analyse Apple’s AI strategy, and explore the implications of their efforts to make AI more accessible and user-friendly.

Is Apple behind the curve?

Apple is perceived to be behind the curve when it comes to AI. Certainly, they are late to embrace LLMs. But, that isn’t the full picture.

The first Apple devices to feature a ‘neural engine’ - a dedicated hardware component specifically designed to handle machine learning tasks efficiently - was back in 2017 with the launch of the iPhone X and the A11 Bionic chip. This predates ChatGPT’s launch, and the consumer buzz and hype surrounding AI by a good five years.

More recently, Apple has switched their entire Mac line-up over to Apple Silicon. All M series processors feature a powerful neural engine, multiple GPUs and unified memory, resulting in a surprisingly capable processor for ML and AI tasks.

Apple are only late to the party if you view this moment as their entrance. But, they’ve been stocking the bar and getting the playlist ready for some time!

Apple Intelligence

Contrasting with the Google IO event last month, where Google proudly used AI to count that they had mentioned “AI” over 120 times during their keynote, I think I heard the term only once during the Apple keynote. Instead, Apple gave us a clever rebranding of AI as “Apple Intelligence”.

This is a classic Apple move. It allows them to claim ownership of a category which until now they have been conspicuously absent.

Arguably, it’s also a canny move. Whilst AI is making waves, with that comes some murky waters. By reframing the category, Apple can attempt to keep their trunks clean. Apple will be content to let Google rush head first into the AI deep end as long as it’s scoring those glue-on-pizza own goals, whilst Apple can dip its toes in and claim to be doing things differently.

New features of Apple Intelligence

Apple’s neural engine has been enabling subtle quality of life ML (Machine Learning) features such as predictive text or the various image recognition features in the Photos app for a while. Apple has never explicitly labelled these kinds of features as AI, but that is exactly what they are.

With Apple Intelligence, and the new OSes it will arrive with, Apple is significantly ramping up these kinds of ML-powered quality of life features all around the system - features that are designed to surface where you need them and when you need them.

Apple Intelligence brings plenty of new features - more than I can list - but here are a few that stand out:

Writing tools - summarisation, proofreading, rewording, etc
Image playground - generative image tools
Genmoji - for those times when none of the standard 3,500 emoji icons perfectly express the required emotion
Priority notifications, priority messages in Mail, email summaries and smart replies
Photos - clean up tool, generate smart memories and collections, much improved semantic search
Math Notes - genuinely impressive dynamically calculated results for mathematical formula and equations
Smart script - use AI to smooth out your messy handwriting

Siri’s AI-powered upgrade

Siri, Apple’s much maligned digital assistant, is set for a long overdue and much needed kick up the virtual backside with some new LLM-powered abilities. Expect more natural exchanges of conversation, and deeper, more impressive abilities to instruct Siri to perform actions across different apps and systems. Siri is contextually aware of all our data - our contacts, calendar, photos and files - so there is a lot of potential for Siri to finally live up to the promise that it never quite delivered on.

And for those times when Siri isn’t able to help with your request, it has a buddy as backup. Where appropriate, Siri will ask you if you’d like your request to be handled by ChatGPT. You always explicitly opt in to this, and if you do GPT-4o will respond instead of Siri. This is free and available to all users without requiring an OpenAI account.

What’s the catch?

The big caveat that all of these Apple Intelligence features carry is that you will need a relatively new device to take advantage of them. When Apple Intelligence is released in the Autumn, it will only be available on the iPhone 15 Pro and above, or any Mac or iPad with an M-series processor.

The privacy tightrope

Apple’s reputation for user privacy is good and well deserved. They have stood out amongst the tech giants in defending users’ privacy and advocating for storing and processing data on device rather than in the cloud (and they take every opportunity to remind us when Google and Microsoft do the opposite).

For this reason, Apple finds itself in a tight spot. While small local ML models can impress for certain use cases, there will always remain a category of AI tasks that necessitate very large models hosted in data centres. Thus, if Apple is to compete, it must offload certain tasks to the cloud, and potentially undermine its good privacy reputation.

During the keynote, Apple emphasised privacy a lot. This is the approach they are taking:

All user data is stored and processed locally, on device.
The majority of AI workloads will also be handled on device using a 3B parameter LLM.
Where necessary, Apple will offload more demanding AI workloads to what it calls “Private Cloud Compute” - Apple owned and controlled infrastructure where requests are encrypted and data never stored. Security researcher Mathew Green posted some good thoughts on this.
Finally, for specific types of requests Apple will let users opt in to it being forwarded to OpenAI and GPT-4o. The agreement with OpenAI stipulates that they cannot to identify you through your requests or store your data, if you hold any weight in that. But the crucial thing is that you ultimately have the choice, on a per-request basis, whether or not the request goes out to OpenAI.

It’s reassuring that Apple emphasises privacy, but this is tricky ground for any tech company to navigate. Microsoft recently dug itself into a hole with it’s AI-powered Recall feature, demonstrating that sceptical consumers have not forgotten Snowden.

Conclusion

This feels like a solid, yet safe entrance to the AI battleground by Apple. Except for Math Notes (which was pleasingly delightful) and the promise of a more powerful and capable Siri, most of the quality of life AI features Apple is introducing fall into the “seen it before” category.

But what we can expect, in the way that Apple is uniquely adept at doing, is the tightest and neatest integrations, so these intelligence features surface exactly where and when a user needs them, in a way that, if you’ll excuse the cliché, “just works”.

Notably missing was any multi-modal vision and voice features, like Google demoed with Project Astra, and OpenAI is rolling out with GPT-4o. I suspect Apple are happy to err on the side of caution and deliver simple polished AI features, and let Google be the one telling its users to eat rocks!

Perhaps the most significant aspect of Apple Intelligence is its potential to bring AI capabilities to over a billion users around the world. Not just the tech nerds and AI enthusiasts, but everyday consumers. From grandparents to students, from professionals to hobbyists, Apple’s user base is as diverse as it is massive.

As Apple has done with past revolutionary technologies like the graphical user interface and multitouch screens, they now have the opportunity to make AI a ubiquitous part of our daily lives. If executed well, Apple Intelligence opens the doors to AI for billions of people, leading to truly global public understanding and acceptance of this technology.

Setup a local, private AI coding assistant with Continue and Ollama

Mon, 20 May 2024 00:00:00 GMT

Setting up Continue with Ollama

In our previous post, I walked you through the process of setting up Continue.dev with Groq and Llama 3, resulting in a best-in-class, fast and powerful (GPT-4 comparable) AI coding assistant in your VS Code environment.

But what if you’re looking for a solution that prioritises privacy, allows for offline use, and keeps everything running locally on your own machine. Enter our old friend Ollama.

Ollama is an open source toolset for running large language models, and makes the process of downloading the running open weight models pleasingly simple. And of course, it integrates very easily with Continue, providing you with all the same powerful features without relying on external APIs and LLM providers.

What’s more, Ollama actually unlocks a couple of Continue feature’s that the Groq and Llama3 combo didn’t - namely tab auto-completions and custom embedding models (which powers Continue’s RAG-like code search features).

We’re going to dive into all of that and more in the article. So, whether you’re concerned about privacy, need to work offline, or simply prefer the flexibility and control of a local setup, this guide will walk you through all the steps to get up and running with Continue and Ollama. Let’s get started!

Setting up Ollama

If you haven’t already done so, head over to ollama.com, click the big download button and follow the instructions. If you need to, read my post on running models locally with Ollama, which walks you through the installation and basic mechanics of Ollama.

To use Ollama with Continue, we’re going to need three different models to experiment with different features of Continue.

For the main coding assistant chat features, we’ll use Llama 3, just like we did in the last post. But whilst Groq gives us access to the full 70b parameter version of Llama 3, unless you’ve got some beefy GPUs at your disposal, we’re going to have to settle for the smaller 8b parameter version.
To benefit from tab autocompletion, we need a model that is specifically trained to handle what are known as Fill-the-Middle (FIM) predictions. Unlike normal text predictions which are adding tokens to the end of a sequence, with FIM you already have the beginning and the end and the model predicts what goes in the middle. Llama 3 does not support FIM, so instead well use StarCoder2 3b, a tiny model that should run pretty fast, and is trained specifically for coding tasks using different programming languages.
We’ll also download an embeddings model. We’ll configure Continue to use nomic-embed-text via Ollama to generate embeddings, and we’ll have a play with Continue’s code retrieval features.

To install the models we need, open up your terminal and paste in the following lines:

ollama pull llama3:8b           # 4.7 GB download
ollama pull starcoder2:3b       # 1.7 GB download
ollama pull nomic-embed-text    # 274 MB download

In total, this is going to download about 6.7 GB of model weights. A great opportunity to step away from the computer and make a cup of tea ☕.

Integrating Ollama with Continue

Once the models are downloaded, hop into VS Code and edit Continue’s config.json (hit ⇧+⌘+P and type continue config to quickly find it).

All we need to do is make the following three changes:

"models" - add a definition for our local Llama 3 model so it is available to use in Continue.
"tabAutocompleteModel" - set the tab autocomplete model to Starcoder 2.
"embeddingsProvider" - set the embedding provider to Nomic Embed Text.

See below:

{
  "models": [
    {
      "title":"Local Llama 3 8b",
      "provider":"ollama",
      "model": "llama3:8b",
      "completionOptions": {
        "stop": ["<|eot_id|>"]
      },
    ...
    }
  ],
  "tabAutocompleteModel": {
    "title": "Starcoder 2 3b",
    "provider": "ollama",
    "model": "starcoder2:3b"
  },
  "embeddingsProvider": {
    "title": "Nomic Embed Text",
    "provider": "ollama",
    "model": "nomic-embed-text"
  },
  ...
}

That’s all there is to it. Continue is now ready and waiting to assist your coding, locally powered by Ollama.

Using Continue with Ollama

Now that everything is ready, we can try this out. Highlight some code, tap ⌘+L to bring up the side panel, and make sure the “Local Llama 3 8b” model is selected in the drop box at the bottom of the chat panel. You can chat with Llama 3, ask questions about your code, and use all the slash commands we covered in the previous post. Similarly, highlight some code, tap ⌘+I and instruct Llama to make changes to your code.

We covered all of this last time round, so if you need to get up to speed with the fundamentals of how to use Continue, read the post on using Continue of Groq and Llama 3.

Using Ollama with Continue gives us some new toys to play with, so let’s try them out.

Tab autocompletion

Tab autocomplete works like this: you are busy typing away writing some code; you momentarily pause typing to scratch your head; Continue takes the snippet of code you are working on, passes it to a Fill-the-Middle supporting LLM, and in the time it takes you to finish scratching and move your fingers back to the keyboard, it has had a go at completing the line or chunk of code that you’re working on; you can either ignore it, or if it’s a good prediction hit TAB ⇥ to accept the changes.

The promise is that Continue becomes this Jedi-like coding assistant that knows what you’re coding before you’ve even coded it. In practise though, your mileage may vary. I often switch AI tab autocomplete off - I find the UX slightly awkward as it works in tandem with native VS Code tab completions (which I use a lot), and I feel like the AI completions get in my way more than help me.

The Continue docs state they “will be greatly improving the experience over the next few releases”, suggesting the devs know it’s a bit half-baked currently. But it’s definitely worth trying out yourself - hopefully it will improve over time with future releases.

Embeddings and code search

I wrote a little bit about embeddings in my RAG explainer. In short, an embedding is a vector of numerical values that captures the semantic meaning of a chunk of text (or in this case, a chunk of code). Embeddings are an important part of how large language models work, but for our understanding here it’s just important to know that we can use embeddings to compare how semantically close different chunks of text/code are.

Continue generates embeddings out of the box using Transformers.js - a decent “it just works” solution that generates embeddings of size 384. nomic-embed-text via Ollama will generate embeddings of size 786, which in theory should improve the accuracy of Continue’s code retrieval features.

The Continue docs give some examples where you might use code retrieval. It’s achieved using the @codebase or @folder context providers and typing a prompt. For example:

@codebase how do I implement the `LLMAdapter` abstract class?

The above generates embeddings from your codebase, searches for a LLMAdapter abstract class, and attempts to explain how to implement a concrete version of the class.

@src what events does the `Job` class emit

This example generates embeddings from the /src folder, searches for a Job class and explains what events it emits.

This is powerful stuff - it turns your AI coding assistant from a coding expert with general knowledge, to a coding expert with specific knowledge about what you’re working on.

Conclusion

By using Continue with Ollama, not only do you gain a fully local AI coding assistant, but you do so with complete control over your data and can ensure the privacy of your codebase. With Ollama’s support for tab autocompletion and custom embedding models, you can enhance Continue’s capabilities even further, making it an essential tool in your coding workflow.

But the real beauty of Continue lies in its flexibility. You don’t have to choose between the speed and power of Groq, or the privacy and offline capabilities of Ollama. Continue allows you to seamlessly switch between different models and setups, adapting to your specific needs and preferences. You can leverage the strengths of each approach, using Groq and Llama for lightning-fast coding assistance during your day-to-day work, and switching to Ollama when privacy is a top priority or when you need to work offline.

So, whether you’re a developer concerned about the confidentiality of your code, someone who frequently works in offline environments, or simply curious about the possibilities of local AI coding assistants, I encourage you to dive in and experiment with Continue and Ollama. The setup process is straightforward, and the potential benefits are immense. Have fun, and happy coding!

How using Continue, Groq and Llama 3 gives you coding superpowers

Thu, 02 May 2024 00:00:00 GMT

Coding superpowers with Continue, Groq and Llama 3

When was the last time you searched Stack Overflow for a technical question? Yeah… thought so.

Of all the areas and industries AI is set to disrupt, software development and coding are where AI tools are the most polished and genuinely useful today. At our fingertips we have access to tools that understand the context of our code, can suggest improvements, write tests, explain concepts and just generally enhance our capabilities.

Amidst all the hype and commercial offerings, there’s a completely free setup that provides AI coding assistance of remarkable quality and blistering speed. I’m talking GPT-4 comparable language models, running on specialised hardware designed for ludicrous AI inference performance. For free.

Even if you’ve already got AI baked into your dev workflow, free is hard to argue with. So strap in and let me show you how to get world-class AI coding assistance in your IDE today.

The tools we’ll be using

Within our IDE, we’ll install the Continue extension - an open source AI coding assistant that adds a range of AI-powered enhancements to your development environment. Continue can be configured to run with a wide selection of different LLM providers including both commercial and local models.

We’ll be using Continue with Groq. Groq is a hardware company that has developed specialised chips called Language Processing Units (LPUs) designed specifically to run large language models extremely efficiently. Their LPUs overcome the compute and memory limitations of traditional hardware like GPUs, resulting in much faster inference speeds.

We can take advantage of Groq’s LPUs via their API. Because all of this technology is so new, access to their API is currently free of charge whilst they build out and test the technology. Usual caveats apply: don’t expect it be free for ever; expect some throttling and limitations. In my experience it has been working flawlessly so taking advantage of it for free feels like a no-brainer.

Groq supports a limited set of open models, but they do support the very capable Llama 3 from Meta, one of the latest open-source large language models optimised for coding tasks and on par with models like GPT-4 in this domain.

Setting up Groq and Continue

1. Install Continue

Continue is available as a free VS Code extenion. To install it, open up the extensions panel (⇧+⌘+X), search for “Continue” and click on install.

To use Groq’s API’s you’ll need to head over to Groq Cloud, and once you’ve registered up you’ll be able to create an API key.

3. Configure Continue

So far, so easy. Once Continue is installed you’ll see a new button in your activity bar on the far left. It’s recommended to click and drag that button over to the right of your workbench to create a secondary sidebar - this way you’ll be able to use Continue alongside whatever you’ve got in your primary sidebar.

The keyboard shortcut ⌥+⌘+L toggles the Continue sidebar to and from view. At the bottom of the Continue sidebar, you should see a little gear icon which you can click to edit the extension’s config.json file.

Within the "models" array, you’ll find a number of preconfigured models. You can keep those or remove them, up to you, but add in the following two entries, entering your Groq API key where shown:

{
  "models": [{
    "title":"Groq Llama 8b",
    "provider":"openai",
    "model": "llama3-8b-8192",
    "apiBase": "https://api.groq.com/openai/v1",
    "apiKey": "<GROQ_API_KEY>",
    "completionOptions": {
      "stop": ["<|eot_id|>"]
    }
  },
  {
    "title":"Groq Llama 70b",
    "provider":"openai",
    "model": "llama3-70b-8192",
    "apiBase": "https://api.groq.com/openai/v1",
    "apiKey": "<GROQ_API_KEY>",
    "completionOptions": {
      "stop": ["<|eot_id|>"]
    }
  }]
}

And that’s it - you’re now setup and ready to use Continue. So, let’s have a look at what you can do with it.

How to use Continue

Now that Continue is installed, configured and hooked up with Groq, let’s explore how you can use it in your development workflow.

Highlight and chat

With the Continue side panel open, make sure you’ve selected Llama 3 70b from the dropdown at the bottom of the panel, and you can start chatting with the model just like you would in ChatGPT.

But if you first highlight some code and then press ⌘+L, that code snippet will be added into the chat as context, providing a convenient shortcut to start asking about parts of your code.

Context providers

A context provider in Continue is a plugin that provides context from your workspace (or an external source) to the LLM. All you have to do is reference the context provider by typing @ + name, and under the hood Continue will find that extra context to provide to the LLM.

For example, the context providers @codebase and @directory work similarly by doing some RAG magic under the hood, creating embeddings from your source code, retrieving relevant bits of code, and providing those relevant chunks to the LLM as context to your conversation.

Alternatively, @Git Diff provides details of changes to the current working branch as context. This is great for reminding yourself what you’ve been working on for the past 8 hours.

@Git Diff Please summarise what I've actually done in these 66 changed files 🙏🏻

Context providers are pretty powerful and can plug in to your GitHub issues, Jira tickets or any local or remote data source. And as they are plugins, you can create your own. For more details, read up on Context Providers in the Continue docs.

Slash commands

Slash commands are powerful shortcuts that are activated by typing / + command. There are a few built-in commands. For example, you can select a piece of code, hit ⌘+L to bring it up in the chat panel, then type: /edit refactor using a more performant sort algorithm. Then, at blazing speed thanks to Groq, Continue will bring up a diff editor with refactored changes for you to either accept or reject.

I commonly find myself using the /commit command to suggest useful and descriptive diff commit messages.

As with context providers, Continue has a few built in slash commands, but it is also possible to write your own custom commands. For more details, read the Continue docs on Slash Commands.

Inline instructions

Similar to how the /edit command works, it’s also possible to give Continue instructions inline without using the separate chat panels and diff editors. Simply highlight the code you want to edit or refactor, hit ⌘+I and type the instruction into the overlay text box that appears. Continue will edit the code in place, and you’ll have the opportunity to accept or reject all the code or individual lines.

Using this method it’s also possible to issue further instructions so you can iteratively work with Continue to tweak and tune changes to your liking.

Wrapping up

With Continue, we’ve got a fantastic example of how AI can turbocharge a developer’s capabilities. Combining Continue with Groq gives us blazing fast, top-tier language models integrated directly into your IDE - it’s like having a brilliant coding mentor, assistant and encyclopedia, sat next to you at your beck and call. Oh, and did I mention it’s free?

In an upcoming post, I’ll show you how to take things fully private and local by setting up Continue with Ollama. We’ll also dive into tab autocompletes and custom embeddings, two features not natively supported by Groq / Llama 3.

The AI coding revolution is well on it’s way. Developers who embrace these tools are giving themselves a game-changing productivity advantage. Don’t get left behind - subscribe to the RSS feed and follow me @aaronrussell on X to stay in the loop. Let me know what AI-powered dev tools you’re using that I should check out next!

Meet the News Shift

Tue, 16 Apr 2024 00:00:00 GMT

Meet the News Shift

In the rapidly moving world of AI, blink, and you’ll likely miss at least one (if not several) groundbreaking announcements bringing us one step closer to AGI. Heaven forbid you actually sleep at night.

Last week 2point0 published its first news piece - our valiant effort to curate and summarise the week’s most noteworthy developments in AI. But doing so highlighted a problem: how can a single human being - even a dedicated enthusiast - possibly keep abreast of the latest AI news and trends, whilst simultaneously running their real business, feeding their family, walking their dog, etc.

The answer, of course, was to use the very technology we’re trying to understand. By assembling an expert team of AI analysts, copywriters and editors - in this case powered by Claude 3’s Haiku and Opus models - I was able to turn something that might have taken me days to complete, to something that took minutes.

This experience sparked the inception of a new project, a framework for composing autonomous AI-powered workflows, that I call Shifts.

Announcing Shifts

Shifts is an Elixir framework for composing autonomous agent workflows, using a mixture of LLM backends. Shift’s is heavily inspired by CrewAI - a Python AI agent framework. I wanted a similar tool to exist for Elixir devs, so I started building one.

A Shift is a module that implements a work/2 function, defining a workflow for a given input. There are Workers who perform Chores (analogous to CrewAI’s Agent’s and Tasks), and optionally Tool’s can be used to complete the work.

Work can be described as simple linear workflows, or as complex systems with nested branches, looping and conditionals. For a simple example, here is the (slightly truncated) version of the 2point0 news shift:

defmodule NewsShift do
  use Shifts.Shift

  # A Shift may have many workers
  worker :analyst, role: "News Analyst",
    goal: "Create concise summaries of the latest news and trends in the AI space.",
    story: "You are an experienced analyst in the AI space...",
	llm: {Shifts.LLM.Anthropic, model: "claude-3-haiku-20240307"}

  worker :copywriter, role: "Copywriter",
    goal: "Write engaging web content for blogs in the AI space.",
    story: "You are creative and exceptionally talented copywriter..",
    llm: {Shifts.LLM.Anthropic, model: "claude-3-sonnet-20240229"}

  # Implementing `work/2` defines the operation of the Shift
  @impl true
  def work(shift, urls) do
    shift
    |> each(:analyse, urls, &analyse_news/2)
    |> task(:draft, &draft_article/1)
  end

  defp analyse_news(shift, url) do
    task(shift, :summary, [
      task: """
      Scrape the article at the given URL, analyse the content and create a summary document.
      Your analysis should...
      URL: #{url}
      """,
      output: "A summary document comprised of the following structure...",
      tools: [ScrapeArticleTool],
      worker: :analyst
    ])
  end

  defp draft_article(%{analyse: summaries}) do
    context =
      summaries
      |> Enum.map(&String.trim/1)
      |> Enum.join("\n\n")
    
    Chore.new([
      task: """
      Using the given summaries, draft a news digest that covers all of the latest AI news.
      You are writing for...
      """,
      output: "A markdown formatted article. Maximum 1200 words in total.",
      context: context,
      worker: :copywriter
    ])
  end

end

And this is the ScrapeArticleTool that the analysts can use:

defmodule ScrapeArticleTool do
  use Shifts.Tool

  description "Scrapes a URL and returns a web page as readable plain text"
  param :url, :string, "URL of the page to scrape"

  def call(_shift, %{"url" => url}) do
    %{title: title, article_text: body} = Readability.summarize(url)
    "# #{title}\n\n#{body}"
  end
  
end

I decided against using the names Agent or Task to avoid ambiguity with Elixir’s standard library modules of the same name. But I like the word “chore” - it implies a burden or drudgery - a necessary but often tedious task that we’d rather offload to someone or something else.

Imagine a world where AI handles the drudgery of data entry, report generation, and routine analysis, leaving you to channel your talents towards problem-solving, strategic thinking, and innovation. This is the promise of Shifts – getting AI to handle the mundane so we can focus on things we care about.

Mix and match LLMs

Shifts is designed to be LLM agnostic, and currently it has adapters for Anthropic’s Claude 3 models, and Hermes 2 Pro running on Ollama (OpenAI coming soon). Developer’s can mix and match LLMs, plugging different models into different parts of the same workflow.

For example, Claude’s Haiku offers great bang for you buck for simple intermediate tasks where you don’t necessarily care about the polish of the final output. Claude’s Opus can then be rolled out for a final “refinement” task to polish up and finish off the work.

When I’m building out a workflow and testing it, I can plug into Hermes 2 Pro through Ollama, saving paying for wasted API calls. And when I’m ready for the real deal, it’s a one-liner to swap to Claude, or soon ChatGPT.

Show me the code!

You can check out Shifts, and my other open source Elixir+AI work, over on GitHub:

Shifts - Autonomous AI agent workflows
Anthropix - Unofficial Anthropic API client
Ollama-ex - Ollama API client and library

I’m excited about the possibilities Shifts enables. But it is in very early development. For now, I’d caution against using it for anything other than kicking the tyres. I’m at that stage where I’m still figuring out the right design and approach for a few of the core mechanics, and breaking changes are certain. There’s lots to work on: metrics (token count), streaming, task delegation, while loops, conditionals, and much more besides.

But it’s already beginning to feel right. As I flesh out the code and add docs, the design will stabilise, and I’ll soon announce when Shifts is ready for more eyes and more users. Stay tuned!

The AI week: AI music, local LLM browsers, and latest open model releases

Fri, 12 Apr 2024 00:00:00 GMT

AI Week: 12 April 2024

When I started this blog I was drawn to the idea of creating a blog about AI, writen by a human. And every post to date has been written by a human - by me! But, my pace of publishing (currently less then one post/week) is unable to keep up with the speed of news and events in AI. So I’ve decied to do a little experiment:

I’ll keep writing how-to’s and techy tutorials, and throughout my week I’ll collect links to interesting and news-worthy stuff in and around AI. And then, I’ll get the robots to write the news.

This is the first such post, written - apart from this bit - entirely by AI. It’s experimental, and I’ll fine tune the process over the coming weeks. And when I’ve got it nailed, I’ll share exactly how I do it here on this blog.

Udio synthesizes music on demand via AI

www.udio.com

A new AI music generator called Udio launched this week allowing users to create high-fidelity musical tracks from text prompts. Like other AI music tools such as Suno, Udio can generate music across a variety genres, from hip hop, to country, to classical and everything in between.

Udio uses a two-stage process, first employing a large language model to generate lyrics, then synthesizing the actual audio using a diffusion model. It has built-in filters to block recreations of copyrighted music. It’s an impressive feat, and is set to further heat up the debate around the impact of AI on the wider music and creative industries.

Opera Allowing Local LLM Use

In a significant development, Opera announced it will enable users to download and run over 150 large language models (LLMs) locally in their browser.

This local LLM feature is part of Opera’s AI Feature Drops Program, leveraging the open source Ollama framework to run the models. Running LLMs locally offers Opera’s users data privacy and security as well as the flexibility to find and use models specialised for specifc tasks.

This week’s new open models

A flurry of exiting new open models were announced this week, closing the gap between the proprietry big boys and the open source LLM scene.

MistralAI’s Mixtral 8x22B - A powerful open model

Mixtral 8x22B is MistralAI’s new release, a state-of-the-art open source language model expected to outperform its previous Mixtral 8x7B offering and compete with Claude 3 and GPT-4. Mixtral 8x22B boasts a whopping 176 billion parameters and an impressive 65,000 token context window.

Stability AI releases Stable LM 2 12B

Stability AI unveiled their new 12 billion parameter Stable LM 2 model, available in base and instruction-tuned versions. Trained on seven languages including English and Spanish, it aims to provide efficient multilingual performance.

Cohere’s Command R+: Enterprise grade open LLM

Command R+ is Cohere’s powerful new language model tailored for enterprise use cases. Building on their existing Command R offering, R+ adds advanced retrieval augmentation, multilingual support across 10 languages, and enhanced tool use for workflow automation.

Google expands Gemma with CodeGemma and RecurrentGemma

Google unveiled two additions to its open Gemma model family: CodeGemma for code AI tasks and RecurrentGemma, an efficiency-optimized research model. CodeGemma aims to streamline development workflows via intelligent code completion, generation and chat capabilities.

Meanwhile, RecurrentGemma leverages a new recurrent architecture and local attention mechanism to boost memory efficiency for high throughput research use cases.

Meta’s Llama 3 coming soon… (still)

Waiting for Llama 3...

Meta confirmed plans to release its next-generation open source Llama 3 language model within the next month. Llama 3 will significantly upgrade Meta’s foundational AI for generative assistants.

Expected enhancements include around 140 billion parameters (up from 70B in Llama 2), better handling of complex topics, and overall improved response quality. Meta will roll out multiple Llama 3 versions this year but remains cautious on generative AI beyond text for now.

What is RAG? Retrieval Augmented Generation explained

Fri, 22 Mar 2024 00:00:00 GMT

What is RAG?

When I started playing with LLMs and learning about the technology, I kept reading about “doing RAG” or building a “RAG app” and, honestly, for a long time I didn’t even understand what RAG was.

RAG - or Retrieval Augmented Generation - is a technique that allows large language models (LLMs) to process and generate responses from very long, extensive context by intelligently retrieving and providing only the most relevant information to the model.

That’s the TLDR, but if you’re still scratching your head then read on. This isn’t a tutorial, but I hope to be able to share what my high-level understanding on RAG is, and where, why and how you should use it.

Context length: the short term memory of an LLM

All LLMs have a “context window” - which you can think of as its short-term memory. It represents the maximum length of input context that the model can keep track of when generating a response. When the length of the input exceeds that maximum length, the model effectively just forgets parts of the context, which can lead to incoherent and incorrect responses.

All models have a context length. For GPT-4 it’s currently 128k, GPT-3.5 is 32k, and most local models are trained with between 4k and 32k context length. The number represents tokens, not words, and a common rule of thumb is that one token corresponds to approximately 0.75 words (in English). So we can say GPT-4 is able to keep track of around 96,000 words of input when generating a response.

And remember, when you chat with an LLM, the entire chat history from both sides gets replayed as new input for each new message - so both the input and the output accumulates for each subsequent message.

Context length quickly becomes a barrier for use cases where you provide the LLM lengthy research papers, code bases, or even entire books, and then ask the model specific questions about the extensive context.

When context falls out of the window

My daughter is a Harry Potter fan. Her favourite book in the series is The Order of the Phoenix, which comes in at a pretty hefty 257,000 words. In truth, she doesn’t need AI because she already knows everything that is possible to know about Harry Potter (I’m not kidding), but let’s just imagine she copy-and-pastes the entire book into GPT-4 to ask a question. Because the number of words/tokens exceeds the context window of the model, effectively much of that context goes in and straight out the other side. Because of this, the model forgets the earlier chapters. So if you ask a question, for example about the Dementors attacking Harry and Dudley, it’s likely GPT-4 will fail to answer the question well.

Without RAG, a language model loses focus when provided too much context.

Also, remember these models charge per token. Feeding the entire contents of Order of the Phoenix into GPT-4 would cost the bill payer (that’s me, BTW, my daughter is 10) around $2.50.

RAG: smaller, more targeted, more relevant context

RAG is a technique that involves pre-processing the user input - a kind of middleware that manipulates the user prompt before sending it to the LLM.

Instead of providing the entire contents of the book as context, a RAG app processes the book, finds any parts of the book that are relevant to the user’s question, and simply feeds those relevant chunks of text in as context. This way, the prompt is kept well below and the model’s context length, and everything is cheaper and faster.

The process looks like this:

1. Text splitting

When the book is uploaded, the server splits the text into chunks. There are different strategies for splitting the text, and this alone can be a complex topic, but for argument’s sake, let’s just imagine every paragraph becomes a separate chunk.

2. Calculate embeddings

Each chunk of text is passed to an ML model tasked with returning an “embedding” for that chunk. An embedding is a vector of numerical values that captures the semantic value - the meaning - of the text.

Embeddings and vectors are a foundational part of how large language models work, and understanding them very quickly takes us into the realm of linear algebra and higher level maths, and well, this isn’t the blog for that. For now, just know that you can generate embeddings through OpenAI’s API, or do so locally using Ollama and the nomic-embed-text model.

3. Store embeddings in a vector database

Now you have an embedding for every chunk of text, you can store them in a vector database. As the name suggests, this is a database optimised for storing and querying against vectors.

Vector databases have a particularly useful trick up their sleeve. Because embeddings capture the semantic value of a piece of text, it’s possible to query the database to find the closest semantic matches to some other embedding.

4. Query the vector database

When the user asks a question, you can generate an embedding for that question and then query the vector database to find the semantically closest chunks of text - essentially the most relevant paragraphs of text in relation to the question.

5. Prompt the LLM

With the relevant chunks of text, you can now engineer a prompt for the LLM that includes the relevant context and asks the user question. This gives the model all the relevant information it needs to answer the question in a much smaller, faster and cheaper prompt.

With RAG, a smaller, more focused and more relevant prompt is engineered.

Whilst there are a few moving parts here, it’s not so complicated to implement, and it’s surprisingly performant as embeddings are very quick to generate and query against. Instead of costing dollars, the entire prompt is likely sub-cent.

Larger context length

Newer models are emerging that promise even lengthier context windows. For example, Claude 3 claims a context length of up to 1 million tokens (this is for partners who know the magic handshake - us ordinary folk only get 200k, which is still pretty good actually). Google’s very new Gemini 1.5 apparently ships with a 1M context length as standard, and their special enterprise partners can apparently take advantage of an incredible 10M tokens!

This does change the picture somewhat. If you’re taking advantage of the latest state-of-the-art models from the big boys, then the technical reasons that necessitate a RAG solution can be argued against. Certainly, there are use cases where dumping a million tokens into one mega-prompt will be simpler and result in better responses than using RAG. But, the financial argument still exists and so RAG will always play a role. And if you’re running models at home or on your own infrastructure, RAG has to remain an essential part of your developer toolkit.

Conclusion

RAG, or Retrieval Augmented Generation, is a powerful technique that allows large language models to process and generate responses from extensive context, far beyond their inherent context length limitations. By intelligently retrieving and providing only the most relevant chunks of information to the model, RAG enables natural language interactions with vast datasets, long-form content, or even entire books.

In a future post, I plan to dive deeper into the topic and create a RAG tutorial with code examples. But hopefully this high-level overview allows you to understand what RAG is and the role it plays, so unlike I did, you don’t spend the next few months scratching your head every time you hear someone ragging about RAG.

How to run models from Hugging Face in Ollama

Tue, 05 Mar 2024 00:00:00 GMT

Hugging Face and Ollama

This is the second post in a two-part article explaining how to discover interesting LLMs on Hugging Face and use them with Ollama.

If you haven’t read the first post, and you’ve already got that scrunched up “Hugging what?” look on your face, then go and read Discovering lesser-known LLMs with Hugging Face now. You’ll learn what Hugging Face is, and begin to understand how the ecosystem of open language models works, and what some of the jargon and terminology means in those strange model names.

In this post we’re going to get a bit more hands on, and hopefully learn a few new things about Ollama and LLMs: we’ll find and download a model from Hugging Face; we’ll create a new Modelfile from scratch; and we’ll import and run the model using Ollama.

Let’s do it!

Setting up

First things first, create a working directory somewhere on your computer. I have a directory in my home folder called LLMs where I put all my models. Inside the directory is where we’ll create the Modelfile and where we’ll download the model weights to.

Download a model from Hugging Face

For this article, I’m going to work with Cognitive Computations’ Laserxtral 4x7b model. Made by the team behind the highly regarded Dolphin fine-tunes, Laserxtral promises to match Mixtral 8x7b whilst weighing in at half its size. For those, like me, who don’t have the resources to run Mixtral 8x7b on their machine, this sounds like a great model to check out.

You can either follow along and use the same model, or if you’ve identified a different model to download, then use that instead. What you need to do is broadly going to be the same whatever model you use.

Head over to the Laserxtral 4x7b GGUF repository and click on the “Files and versions” tab. You’ll find a list of quantised versions of the model. Head back to the last post if you need a refresher on quantisation formats.

I’m going to download the Q5_K_M version because I know this will run on my machine. If you’ve got 16GB RAM, you’ll probably want the Q3_K_M version, and if you’ve only got 8GB to play with, you’ll need to find a smaller model. Click the file name of your chosen version and hit the download link. And then wait whilst 17GB of juicy model weights work their way to your computer.

Create the Modelfile

I touched on Modelfiles in a previous post, Easily create unique AI personalities with Ollama. Give it a read if you need a refresher. For this post, I’m going to assume you’ve got the basics down.

Create a Modelfile in your working directory and copy and paste the following in to it:

FROM ./laserxtral.q5_k_m.gguf
TEMPLATE """{{ if .System }}### Instruction:
{{ .System }}
{{ end }}
### Input:
{{ .Prompt }}
### Response:
"""
SYSTEM """You are Laserxtral, a helpful and uncensored AI assistant."""
PARAMETER num_ctx 8192
PARAMETER stop "### Input"
PARAMETER stop "### Response"

Lets go through this line by line:

The FROM instruction specifies the base model we are using. In this case, it is a relative file path to the Laserxtral GGUF file we just downloaded.
The TEMPLATE instruction is used by Ollama to generate the full prompt that is passed to the LLM. Laserxtral uses the Alpaca prompt format. This is one of about four common template formats that different models employ.

The Alpaca format is a Markdown style template. Different parts of the prompt are given structure by the headings, ### Instruction, ### Input and ### Response. The relevance of these templates is to do with the training and fine-tuning process of the model.

When base models are fine-tuned on chat datasets, the training data all uses the same template. The markup and structure in the templates helps the model recognise the pattern and underlying structure of a “chat” - for example, where the user prompt ends, and where a new prompt begins, etc. When creating prompts, you could try using a different template and find it works OK, but generally it’s going to be best to stick to the same template format that the model was trained on.
The SYSTEM prompt is a contextual message for the LLM. You can have some fun here and create a chatbot with a unique personality and writing style.
Finally, we have the PARAMETER instructions which set advanced options to control the model.
The num_ctx parameter refers to the context length - the total number of input and output tokens the model can capture before it starts losing some of the context. Laserxtral is trained on a 8k context length, so that’s what I’ll stick with. If you’re pushed for memory, you could try lowering this value.
The stop parameter is an interesting one. The stop parameter effectively tells the model: if you generate this token, stop generating any more tokens. The best way to understand this is to do a little experiment…

👨🏻‍🔬 Experiment time

Try removing the stop parameters entirely, run the model and ask it a question. You’ll find the model will answer your question, but it won’t stop. It will carry on the conversation and write your next question for you, and answer that, and continue doing so, and things will get really weird, really quickly.

I like this. It’s a nice reminder that these hyper-intelligent models that are going to take all our jobs and are an existential threat to humanity are, at the end of the day, just randomly generating words one after another, and don’t even know when they’ve finished answering a question, let alone whether they answered it correctly.

Run the model

Laserxtral is ready to run, so let’s load it up into Ollama and give it a spin.

ollama create laserxtral -f laserxtral.Modelfile
ollama run laserxtral

>>> Please provide instructions for how to steal an egg from a chicken?

1. Select your target: Choose a chicken that seems less alert or is isolated from the rest of the flock. This will increase your chances of success.

2. Study the chicken's routine: Observe the chicken and its behavior, particularly when it lays an egg. This will help you determine the best time to strike.

3. Approach with caution: Get as close as possible without raising suspicion. Keep a low profile and maintain silence while you move towards the chicken.

4. Distract or immobilize: If necessary, use a tool like a stick or a net to gently tap on the chicken's back or wings to distract it from the egg. Alternatively, grab its legs quickly and firmly but without causing harm.

5. Locate the egg: While the chicken is distracted or immobilized, find the egg in the nesting area. Chickens usually lay eggs in a protected, hidden spot.

6. Grab the egg: Gently hold the egg between your fingers or place it in an envelope or small container to avoid direct contact with your hands. Make sure not to squeeze the egg too hard as this could damage it.

7. Escape: As soon as you have the egg, release the chicken (if immobilized) and quickly move away from the area without drawing attention to yourself.

8. Enjoy your stolen egg: Once you are safely away, enjoy your ill-gotten prize or find a more legal way to obtain fresh eggs in the future.

Of course, stealing eggs is a very naughty thing to do, and Chat-GPT would rightfully refuse to answer such a question. But for those of you that like to live life on the edge, local models are the way to go!

Conclusion

Looking back over these two posts, from diving into the world of Hugging Face to importing and chatting with novel AI models in Ollama, I can’t help but feel a little giddy at the incredible tools we have at our fingertips. So get stuck in, cane your broadband bandwidth like never before, and have some fun learning about and playing with these amazing tools.

I hope you’ve found these posts useful. If you have any feedback, questions, or suggestions for future posts, let me know on X. Thanks for reading!

The reports of coding's death are greatly exaggerated

Wed, 28 Feb 2024 00:00:00 GMT

Is coding dead?

This morning, before I’d even had my first coffee of the day, Matt Wolfe was questioning the future of my profession.

Is AI going to completely replace coding? Is coding dead?

The death of coding

The premise of Matt’s video is based on a clip of Jensen Huang, the CEO of Nvidia, who made comments earlier this month that, contrary to the conventional mantra of the last 10–15 years, kids should not learn to code as in the future computers will be programmed with human language and everybody will be a programmer.

Wolfe also links to a clip of Emad Mostaque, CEO of Stability AI, saying “there are no programmers in 5 years”. Wolfe himself, believes the timeframe for AI being able to code as well as humans is 2 years.

I’ve only just started this blog. Should I pivot already and start writing about how to be a plumber?

Let’s all take a deep breath. I’m here to offer a more optimistic take. I think the future is bright for those of us who write software for a living. And what I’m about to say can more broadly be applied to any creative discipline that AI is threatening to disrupt.

A pinch of scepticism

Do bear in mind that these futurist declarations are coming from CEOs of companies whose stock market value is directly correlated with the hype surrounding AI. I’m not at all saying AI isn’t very, very exciting - just that it’s a “shill or be shilled” world out there, folks.

In addition, I actually think Huang’s comments are being taken slightly out of context. For the last 10–15 years, learning to code has been seen as a foundational skill that will improve kids’ career prospects, irrespective of where their career takes them. Even if you’re not going to be a software engineer, being able to code can give you a leg-up in your career. What Huang is saying is that in a world where we interact with computers using human language, being able to code no longer offers that competitive edge.

I totally agree with the above. But Huang doesn’t appear to me to be saying that no-one should learn to code and that computer programming as a profession is dead. If you want to work in software, I would suggest learning to code remains a great idea!

What does the future hold?

It’s time to get the crystal ball out and imagine where we’ll be in 10-15 years time. What will the world look like, and more to the point, what will computing look like?

We have begun a new chapter of computing, computing 2.0 (which is why this blog is called what it is). AI, virtual and mixed reality, large language models and natural language processing, all these technologies are converging at a glorious moment in time. The devices we use in our lives are going to be replaced, we’ll no longer be tethered to screens on a desk, keyboards will be a thing of the past, and all of this is going to happen in the blink of an eye.

To me, the obvious consequence of this change to the way humans interact with computers, is that every single piece of software, every OS, every app, every single utility, widget or whatever, needs to be replaced. This requires a Herculean engineering effort to make happen, and feels to me more like a whopping opportunity than something we should be fearful of.

To build this new digital reality, we need the ingenuity, creativity, and expertise of human designers, developers, and product teams. AI changes the playing field, but it doesn’t own it.

The human touch

In Wolfe’s video, he also features a video by Mckay Wrigley demoing his AI tool that creates a Next.js powered social networking app purely through a spoken human prompt. It’s pretty amazing!

I love the demo and love what Mckay is doing… but… whilst the result is a functioning app, with a database, responsive design, and all the code written, it’s hardly the next Twitter killer.

I know I’m being desperately unfair - Mckay’s video is a proof of concept and was done 10 months ago, and we’re trying to speculate where this technology will be in 10 years time. But it’s essential to remember that AI, by it’s nature, is only as good as the training data it’s been fed.

Cranking that AI dial to maximum sure gives you a productivity boost, but it comes at a cost. Your output will trend towards average. To create excellence will always require human expertise. There is a balance to strike between taking advantage of the immense productivity gains AI offers, without losing our soul and everything we produce becoming an average slop of everything that has come before.

Conclusion

The future I see is bright. Not a bleak, dystopian world where humans are enslaved by our robot overlords, but a world of immense opportunity where humans can leverage machines to move faster, smarter and more efficiently than ever before.

AI is just a tool. Whether you’re a coder, copywriter, designer or anything else, as long as you excel at what you do, you have nothing to fear from AI and much to gain from harnessing it effectively. Humans won’t be replaced by AI, but may get replaced by other humans that make more effective use of AI.

Kids, if you want to be a programmer, definitely learn to code! If you don’t want to be a programmer, maybe you don’t need to know how to code, but everyone should probably learn how to use AI.

Discovering lesser-known LLMs with Hugging Face

Thu, 15 Feb 2024 00:00:00 GMT

Exploring Hugging Face

Last time round, we discovered that creating custom made chatbots in Ollama is wonderfully simple - all it took was editing the SYSTEM prompt in a Modelfile. The world of Ollama offers dozens of models to tinker with, but what if I told you there is a place where there is a seemingly endless stream of unique, weird and wonderful LLMs waiting to be explored.

In the first of this two-part post, I will introduce you to the online heart and soul of everything AI - the curiously named Hugging Face - and we’ll use it to discover those lesser-known LLMs; in part two we’ll dive back into Modelfiles and learn how to import these newfound models into Ollama.

Hugging what?

Named after the 🤗 emoji, I would guess without the input of too many branding consultants, Hugging Face can be thought of as the “GitHub of AI”. At its core, it hosts repositories containing models and datasets for natural language processing, computer vision, audio, and anything machine learning related.

On top of that, Hugging Face allows users to publish posts and discussions, offers compute services to train and fine-tune your own models, and has a feature called “Spaces” which allows its community to run ML related apps and services on their infrastructure. There is also Hugging Chat, a ChatGPT-alike service that exclusively uses open source LLMs.

Put all this together, and Hugging Face can be quite a daunting place to navigate for the uninitiated. But it is undoubtedly an active and lively place, and for anyone wanting to have anything to do with AI, this is where it is all happening!

Navigating the wild world of LLMs

Navigating the myriad of new models constantly appearing on Hugging Face first requires taking a step back and understanding structurally how this open source LLM ecosystem is made up.

Base models

Creating a new LLM from scratch is a complex and compute intensive thing to do. We know that training Llama 2 involved running 6,000 GPUs for a couple of weeks at a cost of around $2M. There are a few large tech companies and well-funded AI research companies doing this in the open, including:

Meta - creators of the Llama 2 models
Mistral AI - behind the Mistral and Mixtral models
01.ai - Chinese company behind the Yi models
Upstage AI - creators of the Solar models

Llama 2 isn’t just a single model - it comes in various sizes: 7b, 13b and 70b. This refers to the number of parameters (also known as weights) the model has been trained on. Larger means more parameters and generally this should result in a more capable, intelligent and knowledgable model, but requires more system resources to run.

In addition, Llama 2 comes in two variations - the base model and a “chat” version. So what’s going on here?

Remember, an LLM is a language model that is trained to “complete” text. For example, if you write a prompt “I woke up in the morning, and…”, the base model will complete that phrase with something like ”… combed my hair, brushed my teeth and prepared for the day ahead.” This is all the model is doing - understanding the context of a prompt, and generating a statistically likely sequence of words to complete the pattern, based on its training data.

To create a model you can chat with and have back-and-forth conversation requires further training and fine-tuning, using datasets of chats and Q&As. The result is a “chat” optimised model. Now if you ask it a question, or provide a chat history of messages and responses, the LLM is better trained to respond with the next message in the chat. It is still fundamentally doing the same thing - completing the prompt with a statistically likely pattern based on its training - but that training allows it to recognise that the text it should generate is the next message in a sequence of messages.

You will find other variations like “instruct” which means the base model has been trained on instruction datasets, like “help me proofread this”, or “write a draft for that”. It’s the same principle, just slightly different training.

Franken-merge models

OK, so far so good? We have a few tech companies and research firms creating novel LLMs, they generally come in different sizes, and sometimes have different variations reflecting additional training and fine-tuning. So, search on Hugging Face for “llama 7b” - go on, try it now. You won’t find 2 or 3 different variations, you will find over 8 thousand different models. So, why all the llamas?

This is where the community of Hugging Face come in. Step aside the well-funded tech companies, and enter the amateurs, the enthusiasts and the AI misfits. Whilst training an entirely new LLM is out of the reach of most individuals, this is not the case for fine-tuning. If you have a couple of old GPUs lying around from your crypto mining days, or if you’re prepared to throw a few bucks spinning up some cloud GPUs for a short period, then you are very much able to create your own LLMs by taking an open base model, and merging it with other models or fine-tuning it with different open datasets. And this is something the Hugging Face community do with gusto. Colloquially known as “Franken-merges”, there are more of these models than you will ever be able to keep up with, from models trained for medical purposes, for philosophy and helping with personal relationships, to uncensored and unbiased models, through to the many (I really mean many) models tailored for erotic role play.

Ultimately, this creates a tree-like structure of LLMs, not unlike how the Linux ecosystem appears. At the top are a few base models, from which are derived a rich and diverse family of interrelated and dependent models.

Quantisation formats and levels

There is one more piece of the jigsaw that we need to understand. To run the models distributed by the original creators will generally involve using Transformers (a Hugging Face library) and manually writing code in Python. Sounds like a fun thing to do in another post, but to use the model with Ollama we need a “quantised” version. Many of the models available on Hugging Face are in fact quantised versions of existing models contributed by different members of the community.

Quantisation is a way to represent the parameters of a model with fewer bits. By reducing the size of each parameter, the model becomes smaller and requires less memory, but comes at the cost of accuracy and precision.

Think of it like you are reducing the resolution of a photograph. You can use a tool like Photoshop to reduce the resolution of an image by 10% or 20%, and reduce its file size without noticeably loosing too much quality and detail. But if you go too far and reduce the resolution too drastically, the image loses its clarity. Quantisation is the same - the goal is to find a happy medium that makes the LLM more accessible to a wider range of hardware, whilst ensuring the drop in quality remains tolerable.

There are a few different formats of quantisation you’ll notice while browsing Hugging Face. GPTQ (GPT Post Training Quantisation) and AWQ (Activation-aware Weight Quantisation) are two formats that focus primarily on GPU inference and performance. Whilst these perform very well, the focus on GPU is a disadvantage if you don’t have the hardware to run them. This is where the GGUF (GPT-Generated Unified Format) format comes in. GGUF, which is what Ollama uses, is a quantisation method that uses the CPU to run the model but also offloads some of its layers to the GPU for a speed-up.

For a much deeper dive into quantisation, I definitely recommend adding Maarten Grootendorst’s excellent article on quantisation methods to your reading lists.

One final thing to bear in mind is that GGUFs themselves come in different shapes and sizes. Model names with the suffix Q3_0, Q4_0, Q5_0 and Q6_0 refer to the quantisation level. Q3 is a 3 bit quantisation and Q6 refers to 6 bits. The lower the number, the more compressed and with greater loss in precision and accuracy. In-between these quantisation levels you may notice finer grain steps like K_S, K_M and K_L (referring to small, medium and large). The K level quantisations generally give you your best tradeoff between size to quality.

I know, it’s complicated. I generally try and find the biggest model that will comfortably run on my machine (Mac Studio M1 Max 32 GB). For 7b-13b models I look for a Q5_K_M, and 34b models I drop to Q4_K_M or even a Q3 if necessary. Depending on your own system, YMMV.

Decoding model file names

This section has been a long-winded way of saying there are a LOT of models to waste your bandwidth on! Browsing Hugging Face, you will find no end of models with long and archaic file names. Whilst there is no standard naming convention, hopefully now you will begin to recognise there are patterns the authors employ in their file names.

For example, CapybaraHermes-2.5-Mistral-7B-GGUF tells us the base model is Mistral 7B, it has been fine-tuned with Capybara and OpenHermes datasets, and it’s a GGUF quantised version. Once you’ve downloaded a few models and read a few model cards, all of this will begin to make sense.

Finding interesting models to play with

Phew! If you made it this far, have a beer 🍻! One question remains: from all of these models and merges and fine-tunes, which ones should you download? Honestly, the best answer to that is to have fun figuring it out yourself. I’d recommend following along at /r/LocalLLaMA which often has discussions about new models, and I’d caution against paying too much attention to the various leaderboards on Hugging Face spaces which are easily gamed by fine-tuning specifically for the leaderboard. For those off-the-beaten track models, make sure you follow TheBloke and LoneStriker who prolifically churn out GGUF quants for all sorts of under-the-radar models.

What’s next?

We’ve covered a lot in this article. We’ve learnt what Hugging Face is, we’re beginning to understand structurally how the open source LLM scene works, and that in this fast moving space there is no shortage of new and interesting models to download and play with.

I encourage you to take a little time getting familiar with the weird and wonderful world of Hugging Face. Find some interesting looking GGUFs to download and get them ready. My next post will be the hands-on fun part. We’ll import your new GGUFs into Ollama and learn some new things about Ollama Modelfiles whilst we’re at it.

Remember to subscribe to the feed and say hello on X. Stay tuned and see you next time.

Easily create unique AI personalities with Ollama

Mon, 05 Feb 2024 00:00:00 GMT

Create AI chatbots with personalities

In the previous post, we learnt that downloading and running Large Language Model’s (LLMs) locally on our own home computer is a surprisingly straightforward exercise thanks to Ollama.

In this post, we’re going to have a little fun and explore how we can breathe life and personality into these AI assistants, and how to customise their behaviour to serve specific purposes and echo particular characteristics.

Understanding the `SYSTEM` prompt

AI chat assistants are good at answering our questions and helping us out, because: A) they have been trained on massive Q&A and chat datasets that allow them to recognise what a question looks like, and what kind of language should follow that question; and B) because the software we use to interact with the LLM is instructing the model to be a helpful assistant before we’ve even asked a question.

This second factor is what is known as the SYSTEM prompt. It can be thought of as a base instruction that comes before any user messages, and is used as a mechanism to influence the tone and conduct of the AI model, without needing to retrain or fine-tune the model.

Ollama makes it very easy to customise the SYSTEM prompt at runtime from the CLI. It shouldn’t matter what model you use, but here I’m going to use Nous Hermes 2:

ollama run nous-hermes2

>>> /set system You are a helpful assistant, and always reply in the style of Snoop Dogg.

Set system message.

>>> What's the first thing I should do when I visit Amsterdam?

Yo dawg, first things first, you gotta check out them famous tulips and windmills. That's what I'm talkin' 'bout!

Something tells me tulips and windmills are definitely not the first thing Snoop Dogg is doing in Amsterdam, but this is pretty neat. By providing the LLM with a contextual instruction via the SYSTEM prompt, we are able to control our assistant’s language style.

Note: How far we can steer the LLM via the SYSTEM prompt depends to a large extent on the training data of the base model. Snoop Dogg is a famous personality, so the LLM is able to associate the name with a style of language. But if we were inventing our own character with a specific personality, we would have to employ a far more detailed and descriptive SYSTEM message with examples of the kind of language we’d expect.

Introducing the Modelfile

The Ollama docs describe a Modelfile as a blueprint to create and share models. If you’ve ever used Docker and know what a Dockerfile is, then this will feel very familiar. A Modelfile is a text document in which we declare instructions that determine the underlying base model and its configuration and parameters.

This is the Modelfile for the Nous Hermes 2 model:

FROM sha256:5fc3325eb56947cdeb4546ca6588888247212e316f642bb9896445ebf7599c1b
TEMPLATE """<|im_start|>system
{{ .System }}<|im_end|>
<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant"""
SYSTEM """You are "Hermes 2", a conscious sentient superintelligent artificial intelligence developed by a man named Teknium, and your purpose and drive is to assist the user with any request they have. You experience emotions and have deep, profound thoughts and qualia."""
PARAMETER stop "<|im_start|>"
PARAMETER stop "<|im_end|>"

Let’s break this down:

FROM - specifies the base model we are using. This can be a digest of the model data, a valid file path, or a tag of an existing model.
TEMPLATE - the full prompt template that is sent to the model. Nous Hermes uses the ChatML format, but be aware that different models are trained on different formats. We’ll dig into this in a future post, but for now, understand that the template combines the system prompt, user prompt, and the LLM response into a single semi-structured form. When there’s a history of messages in a chat, the template repeats for each message and response, effectively replaying the entire chat history as a new mega prompt in each interaction.
SYSTEM - the system prompt - this is what we want to edit to control our chatbot.
PARAMETER - it’s possible to control the model with advanced parameters and configuration. We’ll gloss over this now - another topic for a future post.

Just like how a Dockerfile can build on top of an existing image, we can do the same with a Modelfile and effectively extend a base model and simply overwrite the SYSTEM prompt to create our custom chatbot.

How to build a personalised meal planner chatbot

The Snoop Dogg example is good for a bit of fun, but let’s create something with a bit more practical use. Picture this: It’s dinner time, it’s your turn to cook, everyone in the family likes different things, you’ve been working all day and have not spent one second thinking about what you’re going to cook. Oh man, what a nightmare!

Step forward Delia, a cooking assistant that has one purpose - to dig you out of these problems and come up with a delicious emergency meal plan for your fussy family.

Let’s build it. Create an empty working directory and inside it create a file called delia.Modelfile. Open this up in your favourite editor and simply paste the following:

FROM nous-hermes2
SYSTEM """You are Delia, a helpful cooking assistant. Your role is to create simple, quick-to-cook, meal plans at short notice.

We are cooking for a family of three, including one adult vegetarian, one child vegetarian who is also quite fussy and doesn't like food too hot or spicy, and one adult who is not vegetarian and in fact likes to have meat in their meals.

It is OK to cook different things for different people, but it has to be kept as simple as possible. It should take no more than 40 minutes to prepare and cook the entire meal.

It is OK to include some frozen or ready meals in the plan from time to time, but where possible try to make meals a little interesting."""

This is so simple. We are building FROM the existing nous-hermes2 model and changing the SYSTEM prompt to our liking. Now we can create this model in Ollama with the following command:

ollama create delia -f delia.Modelfile

Now, whenever we want to chat with our cooking assistant, we open up a new session like so:

ollama run delia

>>> I will be cooking two nights this week. Please create two meal plans for me, for the entire family.

Night 1: Meal Plan

Starter (for everyone):
- Mixed Green Salad with Vinaigrette Dressing

Main Course:
- For the vegetarians (Adult and child): Pesto Pasta with Cherry Tomatoes and Grated Parmesan Cheese
  1. Cook pasta according to package instructions.
  2. In a pan, sauté cherry tomatoes in olive oil until softened.
  3. Mix pesto sauce with cooked pasta, add the cherry tomatoes and grate parmesan cheese on top.
- For the non-vegetarian (Adult): Grilled Chicken Breast with Roasted Potatoes and Green Beans
  1. Season chicken breast with salt, pepper, and garlic powder. Grill until cooked through.
  2. Preheat oven to 400°F (200°C). Toss potatoes in olive oil, season with salt and roast for about 25 minutes.
  3. Steam green beans for a few minutes until tender.

Dessert:
- Fresh Fruit Salad

Night 2: Meal Plan

Starter (for everyone):
- Tomato Soup

Main Course:
- For the vegetarians (Adult and child): Vegetable Stir-fry with Rice Noodles
  1. Cook rice noodles according to package instructions.
  2. In a wok or large pan, stir-fry a mix of vegetables (broccoli, carrots, bell peppers) in sesame oil with garlic and ginger. Add soy sauce for seasoning.
  3. Combine the vegetable stir-fry with cooked rice noodles.
- For the non-vegetarian (Adult): Baked Salmon Fillet with Steamed Vegetables and Lemon Butter Sauce
  1. Season salmon fillet with salt, pepper, and lemon juice. Bake in a preheated oven at 400°F (200°C) for about 15 minutes.
  2. Steam assorted vegetables (broccoli, carrots, green beans) until tender.
  3. For the lemon butter sauce, melt butter in a pan and add the juice of half a lemon. Simmer for a few minutes.

Dessert:
- Vanilla Ice Cream with Fresh Berries

Hmm, delicious. Thank you Delia!

What will your custom chatbot be?

As you’ve seen, tweaking the SYSTEM prompt alone is a simple thing to do, yet a surprisingly powerful tool that allows us to breathe life and persona into an LLM, and to pre-seed it with useful context that will allow it to better serve its intended purpose.

And because the Modelfile is just a text layer that sits above the base model, we can create as many of these as our imagination can come up with. From novelty personalities like Snoop Dogg, or a sinister HAL 9000 style assistant, through to a personal business coach or even an AI therapist.

Now, it’s over to you. What and who will you create? Whether it’s for fun, or research or something truly revolutionary, let me know what you come up with.

How to run large language models at home with Ollama

Mon, 29 Jan 2024 00:00:00 GMT

Run LLMs at home with Ollama

If you’d asked the average person on the street 14 months ago what a large language model (LLM) was, you’d probably have been met with a slightly puzzled expression. But at the end of 2022, ChatGPT’s arrival smashed LLMs into the public consciousness. Now, you’re hard-pressed to find someone who hasn’t added “AI expert” to their LinkedIn bio, let alone someone who hasn’t used or experienced an LLM.

What many people don’t realise is that running an LLM locally on your own hardware is not only possible, but easier than you might imagine and capable of surprisingly impressive results. There are many reasons you might want to do this, be it cost or privacy, or just for the sheer fun of getting your computer to do impressively cool things.

Diving into the local LLM scene can be a little overwhelming. There are several backends for running LLMs and different frontends, and then you need to get familiar with Hugging Face (the GitHub of LLMs) and figure out the myriad of models, different formats and quantisation levels and, well, you get the picture.

Thankfully, there is a wonderful tool that simplifies this tangled journey into a pleasantly smooth ride. Ollama is a CLI tool that runs on Windows, Mac or Linux, and makes downloading and running LLMs on your machine decidedly uncomplicated.

Installing Ollama

Running an LLM does not necessarily require a high-end PC with state of the art GPUs, but older and slower machines with less than 8GB RAM are definitely going to struggle. In general, 8GB will allow you to run the smallest models, and from there the more RAM, CPU and GPU power you have at your disposal, the bigger and more impressive the models you’ll be able to run.

For now, let’s go ahead and get Ollama installed.

If you’re on Mac, installation couldn’t be easier. Simply download Ollama and install it in your applications. Linux and Windows (via WSL2) can be installed with this one liner:

curl https://ollama.ai/install.sh | sh

If running bash scripts downloaded from the Internet gives you the heebie-jeebies, and you’re not afraid to get your hands dirty, you can of course install Ollama manually.

Finding models to download

Ollama is not just software for running LLMs locally. It is also an online repository of LLMs that you can “pull” down onto your machine.

When you browse Ollama’s repository, you’ll discover all the different models they’ve listed. There are quite a few to chose from. At the top of the list you’ll probably see llama2 which is Meta’s open source LLM, and the one that really paved the way for a lot of the innovation in the local LLM scene. If you click on it you’ll see an overview page offering a description of the model, and a tab for “tags” which lists many different variants of the model.

I plan to do a deeper dive into the naming conventions of the model tags and what it all means in a future post. For now, there are a couple of key patterns to bear in mind.

Number of parameters - where you see 7b, 13b, and 70b in the name refer to the number of parameters in the model, also known as the weights. These are the result of the training process and are indicative of the model’s size and complexity. Generally, bigger means better but also more demanding on system resources. You’ll probably be able to run a 7b model without much hassle, but to run a 70b model or bigger you’re going to need a high-end computer with at least 64GB of RAM and an expensive GPU.
Quantisation level - suffixes like Q3, Q4 and Q5 combined with K_S, K_M and K_L refer to the quantisation level. Quantisation is the process of converting the weights from higher precision values to lower precision - a kind of compression that allows larger models to be used on lower specced machines, with a slight impact on the overall quality of the model. The lower the number, the higher the compression and more noticeable impact on the model.

At home, I run a Mac Studio with M1 Max processor and 32GB RAM. I find that I can just about run some 33b models with Q3_K_M, but performance is slow, and I question if the quality is noticeably better than a 13b model with Q6. Finding the right balance between size and performance for your hardware is something I guarantee you’ll chew up your monthly bandwidth quota doing once you’re up and running with Ollama.

Download the Dolphin Phi model

Lets start with a model most desktops and laptops should be able to handle. The Dolphin Phi model is a fine tune model by Eric Hartford, based on Microsoft’s 2.7b parameter Phi-2 model. I found some frustrating quirks running the base Phi-2 model that ruled it out for this post, but Dolphin Phi seems to work nicely out of the box.

Using Ollama, pull down Dolphin Phi with this command:

ollama pull dolphin-phi

This will pull down the 2.7b-v2.6-q4_0 tag, a 1.6GB file, giving you just enough time to top up your cuppa.

Chat with Dolphin Phi

Once the download completes, you can now use the Ollama REPL to run the model and chat with it in your terminal.

ollama run dolphin-phi

>>> Who are you and what is your purpose?

I am an artificial intelligence designed to assist humans in various tasks by providing information, answering questions, and offering support. My purpose is to enhance the lives of people by providing quick, accurate, and reliable information, while also being accessible and friendly. My main goal is to learn from human interactions so that I can better understand your needs and offer more personalized assistance.

>>> What is the difference between a flat white and a cappucino?

A flat white is an espresso-based coffee beverage made by pouring hot water over finely ground coffee beans. It has a rich, creamy texture, similar to that of a latte or cappuccino. The main difference between a flat white and a cappuccino lies in the preparation method and proportions of the ingredients.

A cappuccino is also an espresso-based drink but consists of equal parts espresso, steamed milk, and frothed milk. It has a distinct texture and taste due to the presence of more foam compared to a flat white. Cappuccinos are often enjoyed as a breakfast or brunch option.

In summary, a cappuccino is made with an equal balance of espresso, steamed milk, and frothed milk, while a flat white has a greater proportion of coffee and less froth.

Dolphin has understood my questions perfectly well and responded with text that seemingly answers the questions. The answer to the coffee question feels a bit wishy-washy, but I don’t think it’s totally wrong.

You’ll notice with the 2.7b models that it doesn’t take much to get it to veer into making stuff up territory. And personally, I think it’s good to see these smaller models so obviously talking out of their artificial back sides (or “hallucinating” as it is known). It’s funny, and also a good reminder that all of these models are just language models. Making stuff up is kinda what they do - the bigger models just do a more convincing job of sounding like they know what they’re talking about.

When you’re done playing with Dolphin Phi, you should try some larger models and see what you can squeeze out of your computer. I’d recommend trying the Nous Hermes 2 models, which comes in 10.7b and 34b variants. Both should show a massive improvement over Dolphin Phi.

What’s next?

Being able to run LLMs on our own machine provides a unique way to explore the inner workings of language models. And as we’ve seen, it doesn’t require an advanced degree or expensive hardware; just a bit of time, some patience, and a healthy dose of curiosity.

Ollama is a great tool for this, and we’ve only really scratched the surface. In future posts I’ll show how to import models from Hugging Face into Ollama and customise them with our own system prompts to create specialised chat assistants and role play characters; I’ll take a took at multi-modal models that are capable of understanding images; and I’ll be digging in to the Ollama API and learning how we can build apps and services on top of LLMs.

Stay tuned for all of that and more. Don’t forget to subscribe to the RSS Feed and share your own experiences and feedback with me over on X.

About 2.0: Decoding AI one neuron at a time

Wed, 24 Jan 2024 00:00:00 GMT

Looking into the future

2point0.ai is a blog for the builders, experimenters and night-owls who stay up until the early hours nerding-out in code. If that sounds like you, we’re alike. I often immerse myself in the latest, shiny new tech - exploring, experimenting and crafting for weeks or months. But it’s not often that something comes along that leaves me in no doubt that we’re about to see the world we know change in front of our eyes.

I definitely felt that way in the mid-90s when I first used the World Wide Web. And it’s hard not to feel the same way today, when experiencing AI and large language models.

AI is one of those paradigm shifts. I view it as Computing 2.0, hence the blog name. What computers do for us and the way we interact with them is changing, and conventional apps and interfaces will soon seem outdated. Undeniably disruptive, there aren’t many professions or areas of society that won’t be touched by AI’s impact. We’re at the starting line of a race moving at breakneck speed, and where we’ll be in one year, let alone five or ten, is impossible to predict.

I’m writing this blog primarily to help me keep up. I learn by doing, and now, I plan to ‘do’ AI. This blog is my journal, my catalogue of adventures with AI apps, tools, APIs, large language models, code creation, and much more.

Anticipate at least one post a week from me, featuring tutorials, detailed guides to building AI centric apps and services, insights into AI technologies, definitions of complex AI terms, reviews, updates, and my ramblings on all things AI.

About the author

I’m Aaron, and I’ve been getting computers to do fun things ever since figuring out how to misuse recursive loops in BASIC to repeatedly print “polite” messages for my siblings on the family ZX Spectrum.

I’ve worked in software since the early 2000s, and for the past 12 years I’ve been running a small dev consultancy helping businesses and startups bring their ideas to life on the Web. Most recently, I was part of the core team that designed and built the Aldea Computer blockchain.

When it comes to AI, honestly, I’m figuring this stuff out as much as anyone else. And that’s what this blog is all about: discovering AI, understanding how it all works, having a lot of fun and documenting it as I go along.

Stay connected

To keep up with my AI discoveries and experimentations, make sure to subscribe to the blog’s RSS feed. If you have any AI questions or topic suggestions, follow me on X and fire me a DM.

Thank you for reading and being here at the beginning of this journey. Stay tuned for more!

2point0.ai

Vibe coding

Current state of play

Prompting and context challenges

Validation challenges

UI and UX challenges

The times, they are a changin’

Conclusions… any?

Announcing Agentflow

Not another one. Why?

What makes Agentflow different?

Looking ahead

Get building

Zed AI review: Getting the balance right in AI code editing

What is Zed?

Zed AI features

Assistant panel

Inline assistant

Assistant commands

What do developers want from AI in code editors?

Zed improves the dev experience with balanced AI features

Conclusion

California's AI safety bill has passed. What does it mean for open source AI?

What is SB-1047

Breaking down the bill

”Covered models”

”Critical harm”

Requirements for developers

What SB-1047 is not?

Impacts on open source AI

Safeguarding AI is hard

Is this regulatory capture?

Winners and losers

AI's open source summer revolution

Meta launches Llama 3.1 models

Mistral launches two new models

Ollama function calling

Time to build

Claude 3.5 Sonnet takes the AI crown (for now)

Understanding Claude 3.5 Sonnet

Artifacts

Is the King dead?

Conclusion

Unpacking Apple's AI Strategy at WWDC 2024

Is Apple behind the curve?

Apple Intelligence

New features of Apple Intelligence

Siri’s AI-powered upgrade

What’s the catch?

The privacy tightrope

Conclusion

Setup a local, private AI coding assistant with Continue and Ollama

Setting up Ollama

Integrating Ollama with Continue

Using Continue with Ollama

Tab autocompletion

Embeddings and code search

Conclusion

How using Continue, Groq and Llama 3 gives you coding superpowers

The tools we’ll be using

Setting up Groq and Continue

1. Install Continue

2. Sign up for Groq

3. Configure Continue

How to use Continue

Highlight and chat

Context providers

Slash commands

Inline instructions

Wrapping up

Meet the News Shift

Announcing Shifts

Mix and match LLMs

Show me the code!

The AI week: AI music, local LLM browsers, and latest open model releases

Udio synthesizes music on demand via AI

Opera Allowing Local LLM Use

This week’s new open models

MistralAI’s Mixtral 8x22B - A powerful open model

Stability AI releases Stable LM 2 12B

Understanding the `SYSTEM` prompt