Locklin on science

Coding assistant experience

Posted in tools by Scott Locklin on February 18, 2026

I’m a modest LLM skeptic. It’s not that I don’t believe in LLMs, I am aware that they exist, I just know that they’re not doing what people do when we think, and that they’re not going to hockey stick up and replace everybody. If it helps people, they should use them: I do. ask.brave.com is my first stop for answering transient questions or software configuration issues. It produces useful results and cites its sources; a great search API. It also doesn’t remember what I asked it (Brave is privacy first), which is what you want most of the time. Grok gives OK answers too, but I don’t like the answers as much, and I have no idea what their privacy policies are. Qwen has been OK for answering coding questions and small code fragments.

I have a few jobs I’ve been putting off; fiddly and annoying translations from Python to R, updating APIs, etc. I also have a couple of challenge problems I have asked AI chatbots to gauge where we’re at for things which I care about. Qwen is by far the best free and open chatbot I’ve used, and it had gotten good enough I decided to fork out for claude-code and take it for a spin. Also inspired by asciilifeform’s comments; dude’s grouchier and more skeptical than I am, so I took his statements on the utility of claude-code very seriously. People who use LLMs at work already can probably skip to the end for this, as you already know more than I do about using these things, though maybe some of the observations are of use.

Mostly the type of work I do is numeric, and numeric coding is significantly different from what most do. I never had any doubts that an LLM could do Javascript plumbing, or even back end plumbing code. Lots of examples of this to train on, along with complicated regular expressions, SQL queries and so on. I figured they’d eventually do something with numeric stuff, though it was less clear when it would happen for my favorite programming languages.

Some claude-code notes:

0) You need to pay for the $200/month one to get anything useful done with claude-code. This is annoying as it’s difficult to burn all your tokens, but the cheap plans run out almost immediately. Jerks. I should be able to pay as I go without talking to some salesdork or signing up for a subscription.

1) Claude code has access to your hard drive, and you have to invoke lucifer and kernel modules to keep it from ruining your life. Yah, in principle you can trust the thing. Back in the 90s you could in principle have an RPC daemon on your Sun workstation which executes arbitrary code, and most of the time nothing bad would happen. Anyone who trusts this thing with sensitive code is fucking retarded. You need to run local for this.

2) One of my unpleasant tasks is translation from the lost souls who think Python is an adequate mode of scientific communication to something less insane (in my case, R, though I still hold Matlab is best tool for scientific communication) is the first task. That’s something an LLM should be great at. Mostly the chatbots haven’t been, but recently they seem to have acquired the skill. This was my most pressing reason for trying claude code, which I assumed would be better than a chatbot. Claude managed to achieve the task in maybe something like twice the time it would have taken me, in a fashion quite a bit more code complete than I would have done. Of course it forgot to add a predict method for a bunch of algorithms that people basically only use to predict things, but once I told it to do so, it did. The first go-round it reproduced every python class in the old repo and made them public, which is exactly what you’d expect from a machine that doesn’t understand anything: the actual algorithm is “fit model, predict model” so you need exactly two public functions, with the other functions being called as options inside the create function. Once I yelled at it enough, hollered at it to update the manual pages to match what’s inside the functions and so on, it did a reasonable job.  Another thing I find extremely painful in R: making a vignette and festooning the source with inline documentation using rmarkdown. I’ve always found this onerous, but the LLM don’t seem to mind. I prompted it to use a google style guide for R packages, so the style isn’t horrible. Beating it into shape was a fairly high attention process, though it was my first time using claude code. All told I put much more time into it than I would have fooling around on my own. This is because it’s low effort work, where writing it yourself is high effort work. There’s a problem here: since it’s low effort to generate a lot of code, now you have a lot of code. Code that has to get maintained if you’re actually using it.

3) Another major unpleasant task I have is turning a paper I read into code. For simple things, LLMs should be able to do this. For more complicated things, I assume there is a limit based on prompt windows. Indeed Claude code was able to turn this paper (my go-to challenge problem) into reasonable working R code; Bernoulli Naive Bayes with EM semisupervised updates. This is something I had done myself for a project, but never checked into any remote repo, so I knew there would be no cheating. I also looked fairly extensively for an example on github and didn’t find any (albeit some years ago now, but people are retarded and would rather fiddle with neural nets than this most excellent trick). Claude was considerably slower at this than the translation job, and made what I consider fairly poor code quality, though I didn’t prompt it with any style guides. Still, actually doing the damn thing is pretty good, and I’ll be testing this type of “read the paper, geev mee code,” job further with more difficult problems. For those of you not in the know, Bernoulli Naive Bayes is basically column means, and the EM algorithm is awfully simple: maybe around the complexity of Newton’s method. Someone like me can do it in an hour if you point a gun at me and give me an espresso enema, or a couple of hours if I’m taking my time and being careful. If I can get algorithms from papers on non-trivial problems, this is a nice application for me; I have an enormous backlog of interesting looking ideas with no public code associated with them. Understanding the papers in enough detail to write code is a pain in the ass, especially if you don’t have good building blocks.

4) The final category of unpleasant “I will likely defer this job forever” task is glueing an API into R (or J, which I have ambitions of getting back to), then using that to implement an algorithm. I asked claude to fill out some of the missing functionality from mlpack. Looked OK, I didn’t test them. I also had it code up an API for mlpack for J, which it appeared to do (it’s been so long since I used J, testing it was painful; sorry about all the sub dependencies it put in the repo).

Task 2 and 3 are my most common use cases. Mostly it doesn’t matter if the results are slop. 4 is an occasional dreary task as well, though R has a decent ecosystem of people who have done this for everyone. Telling the thing how to do my daily tasks is probably also automatable to some extent, but it would mostly be a waste of time. Interactive work is interactive, and Captain Kirking it with a LLM agent is just going to piss me off. I don’t even like using R notebooks, so making an LLM R notebook is no good.

qwen3-coder-next:

I also ran qwen3-coder-next on my threadripper. It’s slow, but can be used if the threadripper isn’t chugging on any other serious tasks. The motivation isn’t to avoid the $200 a month subscription fees; it’s the fact that I don’t trust Claude with anything actually sensitive, like things which produce money for me. It was a pain in the ass to get this stood up and functioning. I did it like this:

numactl --interleave=all ./build/bin/llama-server \
-hf unsloth/Qwen3-Coder-Next-GGUF:Q4_K_M \
--numa distribute \
--threads 32 \
-c 262144 \
--no-mmap \
--jinja \
--host 0.0.0.0 --port 8080

ollama basically doesn’t work. In this case, for the first round, I ended up using a python tool called aider to run it (claude-code-agent in emacs for the claude-code interactions). I think aider is a little clunky; it couldn’t figure out how to make a subdirectory from where I invoked it. Probably choking on context. Might be user error somehow; I went back to emacs (gptel-agent) later and fixed it. TPS appeared to be on the order of 20, very slow prompt processing though. Claude is roughly twice this speed, though it feels faster because it’s running on someone else’s hardware and doesn’t choke as badly on context. I was able to reproduce the semisupervised Bernoulli Naive Bayes with EM updates example that claude-code did as well as a simple Python translation example (a novel fast fitting method for logistic regression). Took about as long for the first round, wasn’t as smooth an interaction. Fed it exactly the same prompt. Got the algorithm right in the first shot, but the NB R package was all borked up, which is the kind of thing I noticed in the qwen chatbot. This required a fairly long context window, so I’m a bit dubious pointing qwen-code-agent at a more involved paper until I upgrade my hardware. I actually like the code qwen produces a little better. Not bad for 3 billion active connections, thank you based Chinese frens. Oddly the python translation seemed to give it more trouble, again I think because of the slowness of parsing context windows on the threadripper.

There are a couple of reasonably cheap potential hardware solutions to run this qwen3 thing without heating up the threadripper or spending 10k on a big video card and a new power supply; Strix Halo from AMD and NVIDIA GB10 Grace Blackwell. Both are small boxes running Linux with 128G of shared memory with a medium-beefy GPU. Neither seems to have any huge performance advantages over the threadripper or each other (real world experiences welcome, supposedly NVIDIA is faster on context), but they’d allow me to do vibe coding while using the threadripper cores for other tasks. Nice airgap as well. If anyone owns such a shoebox machine and had good experiences, feel free to pipe up. I ordered the AMD gizmo so I wouldn’t have to deal with maintaining a development environment for ARM chips. I’ll probably run the claude stuff from this machine as well for the airgap benefits.

While qwen3 did an OK job, it was no fun to work with. The slow context parsing speed of the thing makes the tooling even more clunky,  though emacs (gptel-agent) it was a better experience than aider. The agentic part of the mechanism and differences in how something like claude-code works (a NPM package) isn’t fully clear to me yet. “Thing that runs machine generated shell scripts” seems to be about the size of it. How the LLM knows when it’s hooked up to something with agency isn’t clear. I suppose I can ask an LLM for an explanation here.

random unconnected thoughts:

A fun and actually useful thing to try would be to get one of these things to make Lush 64 bit clean. If I could do that without bothering the authors, that would be amazing. Maybe I can burn up some Claude tokens on this when I’m not using it for other tasks.

The chatbot part: I don’t think Claude Opus 4.6 is anything special. Like all the other ones, it speaks authoritatively, talks in circles, contradicts itself and is generally full of shit. Makes a decent coding assistant though. Asking it for advice on buying a machine for running qwen3 locally, for example: actual search engines (including ask.brave.com) produce better results that don’t contradict each other every other line.

Fun thing I didn’t fully realize until performing this exercise: LLMs don’t have state. It keeps state by feeding the prompt (in most cases the entire prompt, including the entire codebase you’re working on, all the search results, etc, every time there is an update) back to the LLM, along with the most recent results. This is, of course insane. It is particularly insane that people think this kind of Rube Goldberg contraption is sentient somehow. LSTMs are more sentient.

Complexity: R packages implementing an algorithm are a decent sweet spot for something like this. The R packaging system is designed to insulate the REPL from shitty coders who understand things about statistics. The context window is never going to be enormous, it’s generally going to be a couple hundred to a thousand lines of code that accomplishes a well defined numeric task.

Productivity thoughts:

One thing which is for certain: Claude code isn’t replacing anyone’s job. Anthropic’s headcount isn’t getting smaller. The good thing about using a tool like this is that it has low cognitive overhead; I have to figure out how to constrain a mildly retarded computard helper and make it do the things I actually care about. Once I’ve read the paper or glanced at the original source I have a fair idea of what I want the result to look like, and I have to break the task down to something a retard could understand. This is something I do for myself already (being retarded  👍), though the degree and quality of my personal retardation is considerably different. I also have to debug the result afterwords: there will be a lot of bugs, where writing code interactively is kind of online debugging. But, it is useful enough and does things I find onerous and unpleasant in a relatively painless manner, so I’m gonna use it. Sort of like an employee, yes: but a bad employee. One you can’t trust with anything important, and who takes longer at accomplishing tasks than doing it yourself. People who trust vibed code with important things, well, rotsa ruck to you.

There’s a hidden cost to this sort of thing. Because you can write a bunch of code without burning up your precious brain-sugars, you will write a bunch of code. Now you have a bunch of code of dubious utility. In my case, I’ve been very careful to not engage in writing code from papers or translating from python or whatever unless I was pretty sure there was paydirt. Now I’m gonna do it more often. While it feels non-tiring to do this sort of thing, it still takes a nontrivial amount of time, and an even more nontrivial amount of time to evaluate the algorithms the LLM made for me. Maybe I should be working on something else?

For a trivial example, I just spent a couple weeks fooling around with this nonsense. I have one machine generated R package of marginal utility to my actual project to show for my troubles, as well as a much better understanding of the abilities of LLM coding assistants. This is absolutely abysmal from a productivity point of view. Lines of code generated looks amazing, but I don’t get paid for lines of code. “Maybe it will pay off in future productivity,” but that sounds an awful lot like the sales bilge on the tin for vendors of these things. The real world results indicate otherwise. They’re even starting to notice the Solow paradox, aka the fact that ladies with a rolodex, telephone and filing cabinet are as economically efficient as putting everything online and in databases.

Consider my likely trajectory with this crap: I’ve already dumped $2200 into a Claude membership and a new piece of hardware to run qwen3-coder for me. I’ll have to configure and maintain that piece of hardware, burning more real world time, and the ongoing cost of claude if I continue the membership. I’ll also burn real world time coding up random ideas I would have ignored in the past, or only approached cautiously. Just like putting the internet on my computard, it will  open up vast new avenues for wasting time, rather than keeping focused in my pursuit of actually economically productive goals. Is it a win or a loss? I can’t tell. Still gonna use it, but cautiously.

https://github.com/locklin/vibe-coding-experiments

33 Responses

Subscribe to comments with RSS.

  1. Darth Vader of the Internets's avatar Darth Vader of the Internets said, on February 18, 2026 at 4:30 pm

    “Sort of like an employee, yes: but a bad employee” and “you will write a bunch of code.”

    Yes, an inexperienced employee with an oversized toolkit it knows how to use but not why, therefore not to be trusted. But indefatigable at doing slop work.

    So I have found it slows me if I give it the job too high level, eg code up an algorithm idea or understand some complex flow. It just does not understand numerical speed and efficiency tradeoffs. But its ability to spit out code has accelerated my debugging of algorithms and optimizing them. I code to to do something, I do not like coding itself much after 40+years of it.

    Debugging has always been tiresome; coding tools like valgrind and gdb exist but are really for debugging datastructures etc or basic logic flubs. They are much less useful for debugging errors from overall application logic and flow, the “it runs but wtf is causing this impossible output/side effect after 2 hours!” . Often, this means the shameful resort to crappy test scripts and print statements. Found Claude code works great for that sort of thing. I haven’t used debugger in weeks. For example, I can just tell it to add customized timing/memory/stats collection tracking code in every function along some execution trace, create test generators to generate weird conditions, and add custom visualizers that show certain conditions. Example: add debugging fields on every order in the order book, look for a specific error condition (eg builder shows a crossed market where none of the feeds indicate one and I have inside orders – a super breakpoint), then build me a custom visualization showing the problem order, the order book, rewind 1 second, and show me the feed packets leading up to the event. Thousands of lines of code that would have driven me bonkers to write, get working and throw away. And when asked to analyze output the dang thing even looked into binary data files with some 1960 command line utilities and suggested how to add checksums etc unprompted. And once fixed, just rm the branch without a thought.

    Also good for overall code hygiene. On old code I told it to instrument and test various caching/ejection schemes and data structures, evaluate memory mapping vs block loading etc, then implement and test output 1-1. All stuff I had on the back burner to investigate but never could convince myself to write.

    FTR I use C++ for this, and I find the structured language makes Claude rock. Does much worse on Python code.

    Like all tools it takes a while to learn, more a matter of learning how to incorporate it better into your own workflow. But I think Anthropic nailed it building it around CLI and Unix pipes as opposed to using it in an IDE where it behaves like a manic version of Steve “one-button is enough” Jobs.

    Running it locally sucks – slow and burns up my TITANs when I have to work there, I wonder what these guys are running in the data center. But that security issue with going central is something I also have not yet gotten comfortable with. And tokens cost $$$

    • Scott Locklin's avatar Scott Locklin said, on February 18, 2026 at 11:03 pm

      I assume the strix halo shoebox is going to be no Titan, but it’s got to be faster than the threadripper on context parsing. And for some tasks, it really has to be done locally even if it’s slow and clunky. Can always look in on it during pomodoro breaks.

      Unlike OpenAI, Anthropic seemed determined to ship an actual product that at least purports to do specific things. They’re still selling pixie dust, but “have an actual product” is a pretty good start, and most of them don’t.

      • Frans Coetzee's avatar Frans Coetzee said, on March 31, 2026 at 8:02 pm

        Thought I’d add another use case here — my main server yesterday had a major hardware failure, including multiple disk failures and RAID corruption. So I had to retrieve data and diagnose multiple hardware failures. Box was essentially bricked. Ended up sitting in front of a plugged in tty, taking screenshots on my cell phone where Claude read these and gave commands, all the way from BIOS screen to getting some basic functionality going. Absolutely amazing, best tech support ever saw – somewhat scary, eg it worked around PAM/SeLinux blocks; figured out python was working but sshd was corrupted and created a script with server to allow me to ship in files and scripts. In around an hour I had retrieved almost all data at bit level, restitched it, diagnosed and re-configured working parts of server, and was able to start shipping bits off-server. It even looked online and suggested replacement hardware. Definitely will install Claude on all my recovery hardware and tools. Probably, it should just be included in all Linux distros as part of recovery tools. Bad news — I foresee many, many fewer jobs at these datacenters than is projected 😦

  2. markfnordfoundation's avatar markfnordfoundation said, on February 18, 2026 at 5:26 pm

    I think this is a reasonable take. I don’t like LLMs at all. I personally don’t want a code assistant, and the risk of “hallucination” (and brain atrophy) is enough to keep me away from chatbots for research. The value prop of summarizing search engine results is also nil to me because most summaries are useless and LLM generated summaries can be worse than useless because the nature of these programs is to mutate what they’re summarizing, resulting in misleading summaries. I saw a previous commenter on a previous post say he uses an LLM system as a surrogate nanny, which just seems like a new frontier in neglecting your children (not to mention the weird surveillance aspect).

    I’ve taken what you said a while back about not outsourcing your own thinking to heart. But then again, I’m an asshole and I figure my time in the software industry is over, because even before LLMs, I hated all the constant bullshit I had to keep up with, the culture of the software industry itself, and I’ve come to realize I’m only a computer programmer in the first place because it was the path of least resistance from being on the computer too much as a kid. I’m now retraining to go into the trades. Should have realized I’m not a good fit for the industry a lot sooner. I’m happy that I’m leaving, but even on my personal code projects, I don’t foresee myself using chatbots. I like crafting my own code, I guess in the same way people still enjoy carving wood. So it goes.

    The amount of money you have to spend is absolutely insane. Talk about an increasing organic composition of capital. My prediction for the software industry is that “programmers” become quality assurance workers on a code assembly line. Software architects will produce the specs, pass it off to the machine, and offshore workers will inspect and tweak the outputs of the machine as needed. That’s just not the kind of job I wanna do, nor do I expect it will pay well (but it will probably pay magnificently for the third worlders who get to sit in air conditioned offices doing this kind of work rather than plowing fields on subsistence farms).

    • Scott Locklin's avatar Scott Locklin said, on February 18, 2026 at 10:56 pm

      I wish you luck in the trades. One of my HS bros who became a refrigerator mechanic got in touch with me recently after about 34 years, he’s had a really good life doing that, and doesn’t have to put up with the soy baloney people do in the software industry.

    • sigterm's avatar sigterm said, on February 19, 2026 at 5:12 pm

      I saw a previous commenter on a previous post say he uses an LLM system as a surrogate nanny, which just seems like a new frontier in neglecting your children (not to mention the weird surveillance aspect).

      I think you’re writing about my comment. It was about an acquaintance, and the surveillance aspect was understated. The children are very clearly not doing fine; my point was this is our brave new world, and this time AI is here to stay.

      I’ve come to realize I’m only a computer programmer in the first place because it was the path of least resistance from being on the computer too much as a kid.

      Isn’t that much of our crowd in one sentence. I sometimes wonder if someone didn’t put a bit too much aluminium in our DPT shots so they could kickstart the IT revolution, and some of us are slowly recovering.

      I’m now retraining to go into the trades.

      Even if this might be the optimal solution for some, I’m uneasy with the exodus to trades, as wokeism seemed like exactly a purge of disloyal (including, moral) elements from the professional/managerial class, a position of high agency in western society. The enemy got what they wanted.

      An adventurous friend who walks and hitch hikes a bit everywhere told me one of the most functional family he found in his travels was a lumberjack’s, someone who had emigrated to the Finnish wilderness to escape covid insanity in western Europe. So perhaps this kind of move is for the best.

      For my part, I see an opportunity for LLM wranglers to start their own business if they can handle marketing, integration and support. They’re very useful at the start of a project. This is a much, much better social situation than, say, the chemical engineer, who, if he has a good idea, is much more likely to depend on a patron from the class above, who has the capital to make it happen. Maybe organic chemists can form small startups and produce 500g/year worth millions, but that’s the most independent situation I’ve heard of, outside the wild west of software.

  3. Sean Purser-Haskell's avatar Sean Purser-Haskell said, on February 18, 2026 at 7:51 pm

    One thing for which I’ve found these systems to have become surprisingly useful over the last few months is debugging. I can ask why a particular test is failing, and the system will chug away for 10-15 minutes and give me a detailed half page description of what’s going on inside the program.

    Unlike a human who could do this analysis, it seems incapable of suggesting a decent general solution to the problem. Nonetheless, debugging itself can be a big sink of time and frustration. Given that it takes a while and doesn’t always work, though, I tend to reach for it after realizing I’ve been stuck for a few hours.

    • Scott Locklin's avatar Scott Locklin said, on February 18, 2026 at 10:53 pm

      I dunno if the fixes are any good, but I’ve got claude working on updating Lush to 64 bit clean. I’m pretty rusty on the codebase, and totally don’t understand the binutils bits in the glowing crystal at its heart, but it all looks plausible and the interpretor passes its normal tests so far.

      https://github.com/locklin/lush-claude

  4. ian's avatar ian said, on February 19, 2026 at 5:44 pm

    Regarding languages for scientific communication, what do you think of Mathematica? To clarify, MATLAB is my tool of choice for my work (heavy simulation and control systems work), but I often wonder if I’m missing out (I’ve only poked at Mathematica). I did try Python for a while and gave up and went back to MATLAB.

    • asciilifeform's avatar asciilifeform said, on February 19, 2026 at 6:32 pm

      At one time I used Mathematica extensively for commercial work, and even persuaded the organization I was consulting for at the time to purchase a site license. And in my experience, it turned out to be an attractive nuisance and dangerous glue trap if used for anything other than certain kinds of exploratory work where literally nothing else will do. First, try literally any and all alternatives you can possibly think of, before plunging into this pit of expensive, slow, and unreliable goldentoiletware.

      • Scott Locklin's avatar Scott Locklin said, on February 20, 2026 at 11:05 am

        BTW thanks for pressing on the Claude thing a few months ago. It is indeed an interesting and powerful tool when you constrain it properly. Been having fun cleaning up and adding functionality to my favorite little lisp. Mixed results; it made some real slop last night, but pointing it at this tool is a good exercise. Like your video device driver: I certainly couldn’t have done this in comparable time.

        https://github.com/locklin/lush-claude/

        • asciilifeform's avatar asciilifeform said, on February 27, 2026 at 6:53 pm

          You’re welcome.

          FWIW, the experiment I mentioned earlier is now posted here: https://github.com/asciilifeform/P4-HMD

          The bot-generated driver is in “kmod”.

          • Scott Locklin's avatar Scott Locklin said, on February 27, 2026 at 9:52 pm

            Thanks for this. Makes sense it was in C; it’s decent at limited C projects. I’ve been vibe coding for a good week and a half now, mostly trusting the thing to test itself for me. Found out lots of amusing stuff with the context window; it regularly forgets that you have to close parenthesis in Lisp, despite working on nothing but lisp for hours on end. More unpleasantly, it also regularly tries to install stuff as root without even asking. It can’t of course; I made sure of that.
            Going through all the mess it created in coming days to check that it does what it claimed.

    • Scott Locklin's avatar Scott Locklin said, on February 19, 2026 at 8:54 pm

      If you do a lot of symbolic integration, mathematica might be useful, otherwise it is as asciilifeform said. There’s plenty of free computer algebra systems out there if you need to do an occasional integral or simplification of complex equations, and they’re all pretty good too. Matlab is best tool for scientific communication, and is good for the kind of work you do.

  5. Arno Brander's avatar Arno Brander said, on February 20, 2026 at 6:08 pm

    I dunno what kind of computer vision/DL doodad OpenAI uses but i’ve found it to be a useful tool that helps me get stuff done faster. As an example I like to take a picture of handwritten formulas and feed them to it. More often than not it will come up with corresponding LaTeX code. But as you said and I will concur i’m not particularly enthusiastic about any of the privacy policies that these companies have. It’s quite evident that they’re harvesting all of your data so i’d be careful about giving them any of my personal info.

    • Scott Locklin's avatar Scott Locklin said, on February 21, 2026 at 12:09 pm

      These kinds of things are great of course. WhisperAI is one of the most impressive tools from these companies, and was available in very early days to be run locally on your hardware. My early use case for it was to transcribe and later summarize the 100 podcasts a week people send me saying “you have to listen to this.” This kind of problem (transcribing audio to text) was sort of half solved for a very long time, but that was a very good solution.

  6. ahgamut's avatar ahgamut said, on February 21, 2026 at 9:35 am

    I also finally caved and bought the $20 subscription this week. Not a fan of writing (HTML/JS/Dockerfile/CMakeLists/Makevars.win/RMarkdown), so I figured I’d have the bot do that. Unsure it’s particularly useful for numeric tasks; implementing old papers is nice, but I prefer to figure out what the damn paper is saying so I can at least suspect errors in the lines it spits out.

    I noticed when opening claude on the browser is that something appears to be running when I’m on other tabs (even before I enter a prompt), so I had claude code run inside a docker container for the initial attempt. With enough cash I’d setup a local model. I wonder what kinds of data they send back.

    The user interface for programming is bad, amplifying the worst aspects of punch-cards: long waiting and incomprehensible errors. I don’t like debugging bot-generated stuff, am slightly horrified you’re using it with R.

    Having the bot port Lush to 64-bit is an absolutely brilliant idea. I’ve bookmarked the repo, let me know if it works end-to-end.

    • Scott Locklin's avatar Scott Locklin said, on February 25, 2026 at 1:16 am

      It works end to end, though Leon points out I didn’t really instrument it for 64 bit. Looked like a logical approach though (basically change all the 32 bit pointers and some fiddley bits).

      The $20 thing won’t last long for coding.

      I don’t worry much about claude phoning home from the browser (you should be running Brave), but obviously it sends a lot back when you’re using the CLI.

  7. togbe's avatar togbe said, on February 23, 2026 at 9:37 am

    Thanks for this; we’ve been waiting on further commentary from you on this as you were one of the few people generally railing against LLMs making things more productive and this provides a good update to how we set our priors.

    The paper algorithm in to code is the kind of tedious but useful thing the clankers are good enough to be dangerous at for now. But can expect them to get better. Running them in adversarial loops (ie critique the code of this algorithm for this paper spec) against each other burns a lot of tokens but gets to better results.

    Also your comment on not having state is spot on. Can get around it for now by having them write out context in to a notetaker or as part of the project, but it’s one of the funny quirks of the architecture and makes them act like the guy from Momento.

    • Scott Locklin's avatar Scott Locklin said, on February 23, 2026 at 11:54 pm

      Claude actually saves its full context window in json files I’ve never actually looked at. I prefer it to keep human readable notes. Doesn’t always keep important info like “don’t forget to close parens when writing lisp.”

      Anyway, it would be wrong to say there has been no progress with LLMs. Progress and economically useful progress are different things. Economically useful progress that meaningfully increases productivity is another thing entirely. Economically useful progress that increases productivity and results in a profitable company is yet another. I’m still not sure it’s all that great; it can do things with seemingly miraculous rapidity and bog down into absurd loops on crap I could do a lot faster. Definitely no brain in the can: if you don’t order it on what the preferred architecture is it will be stupid.

      I also asked it (the chatbot) to invent a machine learning algorithm I already invented and never told anyone about (nothing amazing, kind of obvious, but definitely new) and the herp-derp response was pretty disappointing. I got the idea from reading the literature in a certain order, and I guess knowing things and being me. LLM can’t actually have ideas though, even if it seems to sometimes.

  8. Charnel Mouse's avatar Charnel Mouse said, on February 25, 2026 at 9:53 pm

    Did you need to do much steering of Qwen away from stacking on dependencies? I suppose that’s less of a danger if you’re not doing something built on manipulating data frames.

    • Scott Locklin's avatar Scott Locklin said, on February 25, 2026 at 11:18 pm

      No, started from clean context and only did the limited projects linked above. I’ll do more with it when I get my strix halo box next week.

      Definitely have found numerous limitations with Claude at this point, almost all relating to context window.

  9. Daniel Walley's avatar Daniel Walley said, on February 27, 2026 at 2:59 am

    Personally I find myself only willing to use these tools for constrained problems which are beyond my experience.

    E.g. write a function to identify the point in time when two moving circles will collide (game dev stuff).

    After which I’ll take the time to read, comprehend and like revise the output to my preferences by hand.

    With that approach I don’t have to fear it putting its fingers in the rest of the codebase, and I don’t have to fear it turning the codebase into something I no longer grok.

    Which is my main resistance to over-using these things – if there’s any risk of the codebase becoming something I don’t understand, I just intuit it causing pain down the road.

    Maybe I’m just stubborn about clinging to old fashioned ways, but I’m not sure – the idea of a fully agent generated codebase with no humans around having an adequate mental model of it is terrifying to me, I just don’t see how it won’t inevitability slow things to a crawl after some point of cognitive and technical entropy is reached.

    • Daniel Walley's avatar Daniel Walley said, on February 27, 2026 at 3:06 am

      (Now I guess this is exactly what you’d do with a bad employee as well, so go figure).

    • Scott Locklin's avatar Scott Locklin said, on February 27, 2026 at 9:26 pm

      It simply won’t work on a very large codebase. The context window is only 200k tokens. You can point it at areas which are relatively isolated and ask it to fix that, or examine different pieces and write notes. It tries to clear irrelevant junk, but will often forget important things, like how to write a correct docstring for literate coding, or that you need to close parens in a lisp. It’s a very entertaining challenge to make it go on something beefy, but ultimately the monkey in charge has to make a lot of choices for it, otherwise it just produces insane things.

      Stuff like R packages though are a decent sweet spot of “not too much context window needed.” Architecture; fuhgeddaboudit.

  10. Norme-alitée's avatar Norme-alitée said, on March 9, 2026 at 9:06 am

    I did not have the chance yet, but have you tried this: https://github.com/karpathy/autoresearch? Besides the obvious seed hacking, if we scale it up with more agents, it seems like it could improve the main model by pure chance…

  11. toastedposts's avatar toastedposts said, on March 9, 2026 at 10:18 pm

    I’ve played with these things, but I’m wary of the (very consciously engineered with malice aforethought) dependency trap. OTOH, odds are eventually people will be able to run their own local models with enough capability (I’ve played with local LLMs for “coding assistance”.) OTOH, if I become dependent on a calculator, at least I own the calculator and it’s performing *my* instructions.

    Random aside: Peskin and Schroeder chapter 2 has an integral that’s the transition amplitude for a particle to travel x distance in t time. They claimed that using a proper relativistic operator for momentum wouldn’t help you, and that you’d still end up with nonzero amplitude for faster than light propagation. I balked at that because it looked very close to the situation with the classical Green’s function, and the classical Green’s function is a hyperbolic PDE with a well defined characteristic separating domains of dependence and influence from spacelike points. I did the fourier transform numerically and found that, no, things are actually causal if you aren’t sloppy with relativistic math. As |k| increases, your amplitude compresses to a delta-sphere. OTOH, it may be a minor nitpick irrelevant to scattering cross-section calculations (since they all happen effectively at a “point” with plane wave particle solutions propagating to “far away”. OTOH, I wonder how many of the astounding and mysterious behavior of QFT comes down to pathologies with contour integrals involving delta-functions leading to math errors?

    • toastedposts's avatar toastedposts said, on March 9, 2026 at 10:23 pm

      Fourier image for the classical Green’s function for Klein Gordon eqn is something like 0.5*exp(i*sqrt(k^2+m^2)t) + 0.5*exp(-isqrt(k^2+m^2)t). Their image was just the +i part of that. In both cases, things only propagate at lightspeed or less.

      • Scott Locklin's avatar Scott Locklin said, on March 10, 2026 at 4:13 pm

        I’m not sure what conversation the physics piece is related to, but qwen3-coder-next generates reasonable code, even though the tool using abilities appear to be broken at the moment.

  12. pieter's avatar nicholas said, on March 11, 2026 at 4:38 am

    After using Claude Code for the first time I came to his blog to see if locklin had written anything about it and oh boy he has!

    These tools allow the average person to have unparalleled access to digital freedom like never before I am a self-identified programmer illiterate individual and I like to spend my brainpower researching and implementing farm technologies.. basically a member of the white-collar underclass. I strongly believe that people like me are going to be the major benefactors

    I’ve started using Claude two weeks ago and have :

    located a long time missing family member using OSINT techniques
    Interfaced major producers soil nutrients in general commodity crop based on publicly available data and legal drone flights

    Create a Github repo with now over 30 stars -as somebody who did not know how to use basic bash commands two weeks ago
    Fill out funding documents to allow an undergraduate team to receive hundreds of thousands of dollars in funding
    Upgrade my university org website, making it the best in the the clunky enterprise framework, what-you-see-what-you-get builder they force you to use if you want to have an official domain by automating chrome and overriding the default theme with hundreds of lines of CSS !important

    • Scott Locklin's avatar Scott Locklin said, on March 11, 2026 at 11:05 am

      Be careful. It produces lots of silent bugs, poor architectures and security problems on larger projects.

  13. RetroGamer's avatar RetroGamer said, on March 23, 2026 at 3:40 pm

    I love how honest you are about your skepticism of coding assistants. I’ve been experimenting with them too and while they’re super helpful for generating code snippets, I think there’s still a place for human intuition in the creative process 🤔

    • Scott Locklin's avatar Scott Locklin said, on March 26, 2026 at 10:22 am

      I’m impressed with what they can do, but a lot of the time you should probably be doing something else.


Leave a comment