Are you training your models?
A weird obsession I don't understand fully
When I tell people that I own some 3090s ( powerful last gen NVidia GPUs ) for the LLMs they would almost always ask the same question: are you training your own models? ( or they ask if I am fine tuning them on some data )
There is a thing that some people believe that explaining things as in analogy or in simple terms signifies deep understanding. And thus maybe that’s the way regular people understand products based on AI as things where you “use your own data“ to train the AI so it knows your use case. Or they may be afraid of sharing data online because “it will be used to train the AI“. See? Training. It feels good, it’s like going to gym - you have something weak and you train it and it becomes better.
The analogy with learning is similar - oh we have a model which is learning and we taught them X, Y, Z…
But you know what’s funny actually? That some models are actually worse in production and in hidden test test benchmarks after fine tuning than before. Tricky?
I’ve just asked my friend who knows things why people behave that way and he have me quite simple answer: in their minds you either mindlessly use ChatGPT as a consumer or train your own models, like there is nothing in between.
This way of thinking is leaking into another project which makes “our own national LLM“, they don’t bother that they have model which is worse than Llama 3 ( the small 8B version ) - and it’s not even competitive on the leaderboards. It’s our own and it counts!
But whatever, you should be not criticizing other people as it leads nowhere, it’s better to see opportunities and just move on.
So what you can do between consuming LLM outputs and making your own model from scratch?
I’ve kinda described it in another post:
but I will make a second better version here:
Obviously start simple with commercial propriertary models - figure out if you prefer the 4o or Claude Sonnet 3.5
Use the models a lot, if you aren’t getting rate limited it means you are not learning - remember you don’t pay per token price but you have a subscription, so if you don’t use it you would lose it.
Probably by this time you will learn not to treat the LLMs as a knowledge database and you will start cringing when you will see people asking factual questions to them or riddles. It’s like having google earth and looking only for your own home there and complaining that it doesn’t have the recent data, duh! It’s a language model, not the database or Google, expecting the model not to hallucinate means that you don’t understand how it works, the model output is hallucination by definition as it’s a generative model which makes most probable completion to your prompt.
By this stage you will have some intuition about their strengths and weakness. If you are a technical person you can go to the next steps
Building workflows with LLMs
you use LLMs to solve some problem
you notice that it’s a kind of ETL pipeline ( everything is ETL pipeline if you squeeze enough - for some things that have defined input and output and even better for things that have explicit steps - you can make a map of some process and use LLMs to solve it
And this step is IHMO very interesting.
This way is great if you are making a chatbot. You can include some template before or after your text so that way you steer the LLM in a way you want but core experience loop is just asking it questions and thinking about the answers and working iteratively until you get “something“ and then integrate this manually with existing process.
In the next post I will explain patterns to break out of this cycle. Let’s start with an example - you have some data as input and you want structured data as output. Technically you can achieve it by asking LLM to gibe you an output as json and then parse the output - but this is error prone and LLM won’t always oblige by giving you exactly you ask for as it may insert some fluff or comments. Below is the screenshot from the LLM library ( BAML ) which uses nice hacks to get around this problem.
Oh, and I’ve forget about motivation for this problem. Say you have few job offers on the website and you have skills that would match them all ( or some ) but you don’t want to manually adjust your CV for each application. Say you are proficient in both C++ and Python - for the C++ job you want to highlight your C++ experience and vice versa.
So you can put both your existing “full CV “ with a text manually scraped from the website job offer and ask ChatGPT to give you corrected version. I usually operate on LaTeX typesetting which btw seems to increase coherency and the structure of the output.
See example below:
this is the screenshot for educational purposes from the just join it website
And our CV
So the pirate wants to be hired. The LLM gives below output which is nice starting point. It’s still all text as you see so you can make it as pipeline:
Text output → LaTeX format → PDF
Looking nicer. The architecture for such thing is quite simple:
But there are obvious ethics and technical concerns - it sometimes may just hallucinate the “ideal“ CV which doesn’t have too much overlap with the original one. We don’t want to just send faked CVs as this is shortsighted approach.
The better approach would be to extract skills from both documents, use entity recognition for that, then calculate the match score or something.
And there is another approach which is hard, time consuming and IHMO would produce best results - figure out the skills needed for a given job and as a programmer just learn them. Many people claim that they are fast learners - you can “show don’t tell“ by making few non trivial project for each skill that was on “ideal“ CV which yours doesn’t contain. That’s how you should look at this - as a roadmap of skills you need to excel at some specific job.
And the answer is yes, I train my own models, from scratch, without any base models - and for the bonus points I may even consider re-writing the famous zero to LLM video series from Karpathy’s website https://karpathy.ai/zero-to-hero.html
It would be shorter and faster than explaining why I don’t train models so I can have time to do something useful with them. ( below train and validation loss from my from scratch training )









