tools | Locklin on science

Coding assistant experience

Posted in tools by Scott Locklin on February 18, 2026

I’m a modest LLM skeptic. It’s not that I don’t believe in LLMs, I am aware that they exist, I just know that they’re not doing what people do when we think, and that they’re not going to hockey stick up and replace everybody. If it helps people, they should use them: I do. ask.brave.com is my first stop for answering transient questions or software configuration issues. It produces useful results and cites its sources; a great search API. It also doesn’t remember what I asked it (Brave is privacy first), which is what you want most of the time. Grok gives OK answers too, but I don’t like the answers as much, and I have no idea what their privacy policies are. Qwen has been OK for answering coding questions and small code fragments.

I have a few jobs I’ve been putting off; fiddly and annoying translations from Python to R, updating APIs, etc. I also have a couple of challenge problems I have asked AI chatbots to gauge where we’re at for things which I care about. Qwen is by far the best free and open chatbot I’ve used, and it had gotten good enough I decided to fork out for claude-code and take it for a spin. Also inspired by asciilifeform’s comments; dude’s grouchier and more skeptical than I am, so I took his statements on the utility of claude-code very seriously. People who use LLMs at work already can probably skip to the end for this, as you already know more than I do about using these things, though maybe some of the observations are of use.

Mostly the type of work I do is numeric, and numeric coding is significantly different from what most do. I never had any doubts that an LLM could do Javascript plumbing, or even back end plumbing code. Lots of examples of this to train on, along with complicated regular expressions, SQL queries and so on. I figured they’d eventually do something with numeric stuff, though it was less clear when it would happen for my favorite programming languages.

Some claude-code notes:

0) You need to pay for the $200/month one to get anything useful done with claude-code. This is annoying as it’s difficult to burn all your tokens, but the cheap plans run out almost immediately. Jerks. I should be able to pay as I go without talking to some salesdork or signing up for a subscription.

1) Claude code has access to your hard drive, and you have to invoke lucifer and kernel modules to keep it from ruining your life. Yah, in principle you can trust the thing. Back in the 90s you could in principle have an RPC daemon on your Sun workstation which executes arbitrary code, and most of the time nothing bad would happen. Anyone who trusts this thing with sensitive code is fucking retarded. You need to run local for this.

2) One of my unpleasant tasks is translation from the lost souls who think Python is an adequate mode of scientific communication to something less insane (in my case, R, though I still hold Matlab is best tool for scientific communication) is the first task. That’s something an LLM should be great at. Mostly the chatbots haven’t been, but recently they seem to have acquired the skill. This was my most pressing reason for trying claude code, which I assumed would be better than a chatbot. Claude managed to achieve the task in maybe something like twice the time it would have taken me, in a fashion quite a bit more code complete than I would have done. Of course it forgot to add a predict method for a bunch of algorithms that people basically only use to predict things, but once I told it to do so, it did. The first go-round it reproduced every python class in the old repo and made them public, which is exactly what you’d expect from a machine that doesn’t understand anything: the actual algorithm is “fit model, predict model” so you need exactly two public functions, with the other functions being called as options inside the create function. Once I yelled at it enough, hollered at it to update the manual pages to match what’s inside the functions and so on, it did a reasonable job. Another thing I find extremely painful in R: making a vignette and festooning the source with inline documentation using rmarkdown. I’ve always found this onerous, but the LLM don’t seem to mind. I prompted it to use a google style guide for R packages, so the style isn’t horrible. Beating it into shape was a fairly high attention process, though it was my first time using claude code. All told I put much more time into it than I would have fooling around on my own. This is because it’s low effort work, where writing it yourself is high effort work. There’s a problem here: since it’s low effort to generate a lot of code, now you have a lot of code. Code that has to get maintained if you’re actually using it.

3) Another major unpleasant task I have is turning a paper I read into code. For simple things, LLMs should be able to do this. For more complicated things, I assume there is a limit based on prompt windows. Indeed Claude code was able to turn this paper (my go-to challenge problem) into reasonable working R code; Bernoulli Naive Bayes with EM semisupervised updates. This is something I had done myself for a project, but never checked into any remote repo, so I knew there would be no cheating. I also looked fairly extensively for an example on github and didn’t find any (albeit some years ago now, but people are retarded and would rather fiddle with neural nets than this most excellent trick). Claude was considerably slower at this than the translation job, and made what I consider fairly poor code quality, though I didn’t prompt it with any style guides. Still, actually doing the damn thing is pretty good, and I’ll be testing this type of “read the paper, geev mee code,” job further with more difficult problems. For those of you not in the know, Bernoulli Naive Bayes is basically column means, and the EM algorithm is awfully simple: maybe around the complexity of Newton’s method. Someone like me can do it in an hour if you point a gun at me and give me an espresso enema, or a couple of hours if I’m taking my time and being careful. If I can get algorithms from papers on non-trivial problems, this is a nice application for me; I have an enormous backlog of interesting looking ideas with no public code associated with them. Understanding the papers in enough detail to write code is a pain in the ass, especially if you don’t have good building blocks.

4) The final category of unpleasant “I will likely defer this job forever” task is glueing an API into R (or J, which I have ambitions of getting back to), then using that to implement an algorithm. I asked claude to fill out some of the missing functionality from mlpack. Looked OK, I didn’t test them. I also had it code up an API for mlpack for J, which it appeared to do (it’s been so long since I used J, testing it was painful; sorry about all the sub dependencies it put in the repo).

Task 2 and 3 are my most common use cases. Mostly it doesn’t matter if the results are slop. 4 is an occasional dreary task as well, though R has a decent ecosystem of people who have done this for everyone. Telling the thing how to do my daily tasks is probably also automatable to some extent, but it would mostly be a waste of time. Interactive work is interactive, and Captain Kirking it with a LLM agent is just going to piss me off. I don’t even like using R notebooks, so making an LLM R notebook is no good.

qwen3-coder-next:

I also ran qwen3-coder-next on my threadripper. It’s slow, but can be used if the threadripper isn’t chugging on any other serious tasks. The motivation isn’t to avoid the $200 a month subscription fees; it’s the fact that I don’t trust Claude with anything actually sensitive, like things which produce money for me. It was a pain in the ass to get this stood up and functioning. I did it like this:

numactl --interleave=all ./build/bin/llama-server \
-hf unsloth/Qwen3-Coder-Next-GGUF:Q4_K_M \
--numa distribute \
--threads 32 \
-c 262144 \
--no-mmap \
--jinja \
--host 0.0.0.0 --port 8080

ollama basically doesn’t work. In this case, for the first round, I ended up using a python tool called aider to run it (claude-code-agent in emacs for the claude-code interactions). I think aider is a little clunky; it couldn’t figure out how to make a subdirectory from where I invoked it. Probably choking on context. Might be user error somehow; I went back to emacs (gptel-agent) later and fixed it. TPS appeared to be on the order of 20, very slow prompt processing though. Claude is roughly twice this speed, though it feels faster because it’s running on someone else’s hardware and doesn’t choke as badly on context. I was able to reproduce the semisupervised Bernoulli Naive Bayes with EM updates example that claude-code did as well as a simple Python translation example (a novel fast fitting method for logistic regression). Took about as long for the first round, wasn’t as smooth an interaction. Fed it exactly the same prompt. Got the algorithm right in the first shot, but the NB R package was all borked up, which is the kind of thing I noticed in the qwen chatbot. This required a fairly long context window, so I’m a bit dubious pointing qwen-code-agent at a more involved paper until I upgrade my hardware. I actually like the code qwen produces a little better. Not bad for 3 billion active connections, thank you based Chinese frens. Oddly the python translation seemed to give it more trouble, again I think because of the slowness of parsing context windows on the threadripper.

There are a couple of reasonably cheap potential hardware solutions to run this qwen3 thing without heating up the threadripper or spending 10k on a big video card and a new power supply; Strix Halo from AMD and NVIDIA GB10 Grace Blackwell. Both are small boxes running Linux with 128G of shared memory with a medium-beefy GPU. Neither seems to have any huge performance advantages over the threadripper or each other (real world experiences welcome, supposedly NVIDIA is faster on context), but they’d allow me to do vibe coding while using the threadripper cores for other tasks. Nice airgap as well. If anyone owns such a shoebox machine and had good experiences, feel free to pipe up. I ordered the AMD gizmo so I wouldn’t have to deal with maintaining a development environment for ARM chips. I’ll probably run the claude stuff from this machine as well for the airgap benefits.

While qwen3 did an OK job, it was no fun to work with. The slow context parsing speed of the thing makes the tooling even more clunky, though emacs (gptel-agent) it was a better experience than aider. The agentic part of the mechanism and differences in how something like claude-code works (a NPM package) isn’t fully clear to me yet. “Thing that runs machine generated shell scripts” seems to be about the size of it. How the LLM knows when it’s hooked up to something with agency isn’t clear. I suppose I can ask an LLM for an explanation here.

random unconnected thoughts:

A fun and actually useful thing to try would be to get one of these things to make Lush 64 bit clean. If I could do that without bothering the authors, that would be amazing. Maybe I can burn up some Claude tokens on this when I’m not using it for other tasks.

The chatbot part: I don’t think Claude Opus 4.6 is anything special. Like all the other ones, it speaks authoritatively, talks in circles, contradicts itself and is generally full of shit. Makes a decent coding assistant though. Asking it for advice on buying a machine for running qwen3 locally, for example: actual search engines (including ask.brave.com) produce better results that don’t contradict each other every other line.

Fun thing I didn’t fully realize until performing this exercise: LLMs don’t have state. It keeps state by feeding the prompt (in most cases the entire prompt, including the entire codebase you’re working on, all the search results, etc, every time there is an update) back to the LLM, along with the most recent results. This is, of course insane. It is particularly insane that people think this kind of Rube Goldberg contraption is sentient somehow. LSTMs are more sentient.

Complexity: R packages implementing an algorithm are a decent sweet spot for something like this. The R packaging system is designed to insulate the REPL from shitty coders who understand things about statistics. The context window is never going to be enormous, it’s generally going to be a couple hundred to a thousand lines of code that accomplishes a well defined numeric task.

Productivity thoughts:

One thing which is for certain: Claude code isn’t replacing anyone’s job. Anthropic’s headcount isn’t getting smaller. The good thing about using a tool like this is that it has low cognitive overhead; I have to figure out how to constrain a mildly retarded computard helper and make it do the things I actually care about. Once I’ve read the paper or glanced at the original source I have a fair idea of what I want the result to look like, and I have to break the task down to something a retard could understand. This is something I do for myself already (being retarded 👍), though the degree and quality of my personal retardation is considerably different. I also have to debug the result afterwords: there will be a lot of bugs, where writing code interactively is kind of online debugging. But, it is useful enough and does things I find onerous and unpleasant in a relatively painless manner, so I’m gonna use it. Sort of like an employee, yes: but a bad employee. One you can’t trust with anything important, and who takes longer at accomplishing tasks than doing it yourself. People who trust vibed code with important things, well, rotsa ruck to you.

There’s a hidden cost to this sort of thing. Because you can write a bunch of code without burning up your precious brain-sugars, you will write a bunch of code. Now you have a bunch of code of dubious utility. In my case, I’ve been very careful to not engage in writing code from papers or translating from python or whatever unless I was pretty sure there was paydirt. Now I’m gonna do it more often. While it feels non-tiring to do this sort of thing, it still takes a nontrivial amount of time, and an even more nontrivial amount of time to evaluate the algorithms the LLM made for me. Maybe I should be working on something else?

For a trivial example, I just spent a couple weeks fooling around with this nonsense. I have one machine generated R package of marginal utility to my actual project to show for my troubles, as well as a much better understanding of the abilities of LLM coding assistants. This is absolutely abysmal from a productivity point of view. Lines of code generated looks amazing, but I don’t get paid for lines of code. “Maybe it will pay off in future productivity,” but that sounds an awful lot like the sales bilge on the tin for vendors of these things. The real world results indicate otherwise. They’re even starting to notice the Solow paradox, aka the fact that ladies with a rolodex, telephone and filing cabinet are as economically efficient as putting everything online and in databases.

Consider my likely trajectory with this crap: I’ve already dumped $2200 into a Claude membership and a new piece of hardware to run qwen3-coder for me. I’ll have to configure and maintain that piece of hardware, burning more real world time, and the ongoing cost of claude if I continue the membership. I’ll also burn real world time coding up random ideas I would have ignored in the past, or only approached cautiously. Just like putting the internet on my computard, it will open up vast new avenues for wasting time, rather than keeping focused in my pursuit of actually economically productive goals. Is it a win or a loss? I can’t tell. Still gonna use it, but cautiously.

https://github.com/locklin/vibe-coding-experiments

33 comments

Lush: my favorite small programming language

Posted in Lush, tools by Scott Locklin on November 19, 2024

I meant to write about this when I started my blog in 2009. Eventually Lush kind of faded out of my consciousness, as it was a lot easier to get work doing stuff in R or Matlab or whatever. The guy who was maintaining the code moved on to other things. The guys who wrote most of the code were getting famous because of the German Traffic Sign results. I moved on to other things. I had a thought bubble the other day that I’d try to compile it and see what happened. Binutils guys have been busy for the last decade and changed all manner of things: I couldn’t even find documentation of the old binutils the last version of Lush2 was linked against. Then I noticed poking around the old sourceforge site that Leon Bottou had done some recent check ins fixing (more effectively than me) the same problems in the Lush1 branch. I stuck the subversion repo with history so you can marvel at it on github. I may try to revive a few of the demos I remember as being cool.

I call it a small language; compared to contemporary Python or R it is quite small, and had a small number of developers. The developers were basically Yann LeCun and Leon Bottou and some of their students (there are other names in the source like Yoshua Bengio). This tool is where they developed what became Deep Learning –lenet5; the first version of Torch was in here (as I recall it was more oriented to HMMs at the time). Since it’s a lisp, it’s easy to add macros and such to make it do your bidding and fit your needs. Unlike anything else I’ve ever used, Lush is a real ergonomic fit for the programmer. It has a self-documenting feature which is incredibly useful: sort of like what R does, it takes comments in code and makes them into documentation. Unlike R documentation there is a way of viewing it in a nice gui and linking it to other documentation. So you have a nice manual for the system and whatever you built in it, almost automagically. Remember “literate coding?” It was always a sort of aspiration: this is a real implementation of it, and it’s so easy to use, you’d have to be actively malicious or in a pretty big hurry not to do it. Here’s a screen I made for myself so I could remember how to use some code I built 15 years ago (it still works BTW). You can update it at the CLI, just like everything else in a Lisp.

As a Lisp, you have access to macros which allows you to do magic things that make Paul Graham happy. I am smooth brain: only wrote a couple of them: I’ve written considerably more C macros than Lisp macros and plan on keeping it that way. The Lush authors also don’t use them very often; mostly in the compiler, which is how it should be. “A word to the wise: don’t get carried away with macros.” as Peter Norvig told us in PAIP. There is a nice object system, and a very useful set of GUI tooling. Not just the help gizmo; there’s a full fledged GUI (ogre). Imagine that; something to develop old fashioned graphical user interfaces without importing two gigabytes of Electron and javascript baloney. The helptool uses this; it is not an HTML browser. The documentation format looks a bit like markdown with a few quirks; I never had to look at a manual to write the stuff. Essentially it looks like the standard two sentence comments you put to remind yourself what a complicated function does. It has a nice object system the GUI thing is written in; I assume it’s something like CLOS: whatever it is, there are no surprises and anyone who knows about namespaces and objects can use it. I found it particularly useful for its encapsulation of raw FFI pointers and other tooling which is best trapped in a namespace where it can’t hurt anything.

Since it is oriented around developing 80s-90s era cutting edge machine learning, one of the core types is the array. The arrays are real APL style arrays: rank 0 to rank 4, which is probably one rank higher than most sane people use (most people use rank 2, aka matrices). It looks like it had up to rank-7 at one point: I have no idea what you’d do with that. APLs such as J often have rank-whatever, so someone somewhere has probably done something with such structures. Lush2 had an interesting APL like sublanguage for operating on the arrays, which looked pretty handy, but which I never quite got into (most of my work was in Lush1).

All this is cool, but I suppose other small programming languages promise things like this. The really cool thing about it is the layers. You get a high level interpreted Lisp. You also have a compilable subset of Lisp; mostly oriented around numerics things, just as one would expect in a domain specific language one might develop early convolutional net/deep learning algorithms in. Even better than this, if you want to call some C, including calling libraries, you can enclose your C in a Lisp macro and compile it right into the interpreter. Most of the interesting and useful code in the world still sits behind a C API. With a tool like this: suddenly you have a useful interpreter where you can vacuum in all the DLLs you want, and they’ll be available at the command line.

Most interpreters have some FFI facility for doing this; none to my knowledge are this easy to use or powerfully agglomerative. The memory management happens for free, more or less. In, say, R’s repl, you can do something called dyn.load on libraries with R compatible types. If it’s more complex than that you might have to write significant wrapper code, and this is a hack: it might just leak memory all over the place. You have to work pretty hard to encapsulate C libraries in a proper R package, compiling against the R sources. J, same story; you can use the 15!:0 foreign to load a dll and wrap up J structures to send, with some tooling to deallocate or copy memory locations (very carefully). In Lush, you call the C functions directly, in C, on C’s terms (or C++ ). You can write a couple of lines C wrapper, a couple of pages; whatever: it’s all a part of the Lush source. If you look at examples of well-wrapped dlls in R on CRAN, you’ll see they’re festooned with all manner of ugly R structure casts, mysterious R #defines and all kinds of badness and quasi-memory management you’d have to read a 300 page manual to make sense of what’s going on. Having done this a few times, I’m exaggerating a tiny bit, but it is tedious and fiddly and takes a fair amount of work; a couple days if you’ve never done it before, versus a couple minutes. In Lush you just stick a dollar sign in front of variables you allocated in Lush in your C function calls, and after it’s been compiled into the interpreter (which happens if you “libload” the file), you call them, variables appear where they’re supposed to. No memory leaks. Usually doesn’t take down the interpreter when something goes wrong, though of course if you send something weird to a raw pointer it will probably segfault and die. Here’s an image grab of a simple method for instantiating a KD-tree using LibANN (a bleeding edge nearest neighbor library of circa 2009):

First lines are the documentation; inside the defmethod we try to make a new kdtree; the stuff between #{ and }# is normal C++. You can see the $ in front of $out, this tells the Lush compiler to pull the result back into the interpreter. This method gets compiled and loaded and accessed like any other method in Lush. idx2 is a matrix type, the other stuff does what you think it does.

Lush dates from 1987: I don’t even remember what kind of computers people used back then. I assume something like a 68020 Sun Workstation or a VAX. Even when I was using it in 2009, a “multicore” system might have two cores, so it wasn’t really designed with that sort of thing in mind either (though you could link to blas which do this in most numerics cases and it has tooling to use it on a cluster). Some of the intestines of the thing probably reflect this. I’m pretty sure Lush1 is not completely 64 bit clean: when I was using it in 2009 it was 32 bit binaries only, which was fine as nobody had 256g of ram back then. Other stuff which will seem unfamiliar to contemporary people: it’s for talking to local libraries. There is no provision for a package manager over the interbutts, or much other network stuff I noticed beyond sockets. No JSON (didn’t exist; s-expr are better anyway), sql interfaces (was exotic pay-for technology) and none of the stuff modern code sloppers are used to having. It was mostly a tool for developing more Lush code which links to locally installed libraries: this is what R&D on machine learning algorithms had to be back then. As a tool for building your own little universe of new numerics oriented algorithms it is almost incomparably cozy and nice. You get the high level stuff to move bits around in style. You get the typedefed sublanguage to compile hot chunks to the metal, and you get C/C++ APIs or adding new functions written in C/C++ as a natural part of the system. Extremely cozy system to use. While it’s not the Lisp machine enthusiasts like Stas are always telling us about, it’s probably about as close as you’re going to get to that experience using a contemporary operating system and hardware. Yes you have to deal with the C API: I’m sorry about that, but it’s just current year reality. Nobody is going to rewrite BLAS in Haskell or CMU-CL to make you happy. Purity is folly.

As a tool, if I had to fault it for anything, it’s a few small things which I could probably fix. For example, in Kubuntu anyway, you can’t copy/paste examples from the helptool. This is probably something that could be repaired if I dig down into whatever X library the ogre package calls to do this. It’s no big deal; not a very wordy language anyway, and I should be reading the docs and typing code I’m about to run in emacs rather than copypasta. Another slightly annoying thing is a lack of built in pretty-print for results. Many languages have this problem: in Lush it’s easy to write one and I have one around somewhere. Some of the packages aren’t well documented and some don’t work because of various forms of bitrot: this is to be expected in something this old. Other than that, no faults. Very cozy programming language. The coziest.

The C insides are fairly understandable, modulo the glowing crystal dldbfd.c gizmo at the center that does the binutils incantations that make the dynamic linking magic happen. Even that looks like it could be understood if you were familiar with binutils. Lush1 there are a number of odd pieces that were planned to be sawed off which you can sort of infer by their absence in Lush2, which had a redesigned VM. However, Lush1 compiles and runs the old code, and Lush2 doesn’t.

While this programming language could (and really should) be revived, even in its present state it can be marveled at. Both for its historical importance in developing machine learning algorithms, and for its wonderful “programmer first” utility. I don’t know what exigencies caused them to move the Torch neural net library to Lua; probably whiny wimps who were intimidated by parenthesis. I can guess why it ended up in Python (the Visual Basic of current year). It’s one of those things where, had things worked out a little differently, machine learning people would be typing lots of parenthesis in vastly more futuristic Lush instead of drearily plodding along with spaghetti in Jupyter. It represents a very clear vision of how software development should work. No bureaucracies or committees were involved in its design: just people who needed a good tool to invent the future. I suspect the committees and social pressures involved in larger programming languages is why they’re often so awful. Lush is all designed and built by makers, not bureaucrats and “product managers.” It feels purposeful. It also feels incomplete, which is as it should be, as these guys were too talented to maintain programming languages. Like an unfinished DaVinci painting; you can see the grandeur of the artist’s vision.

I’ve always been fans of these guys; as I pointed out in my article on DjVu, there is much to admire beyond their good taste in algorithms and dogged determination to continue working on them at a time when only eccentrics were interested in neural nets. All the cool kids of the era were doing SVMs …. because …. researchers are mostly trend following rather than thinking. Hopefully I don’t cheese them off too much by bringing it up, though as an American it is arguably a sovereign duty to piss off the French. For myself, I have a shitload of work to do in coming months. I sort of hope I can find an excuse to fiddle around with it some more, or maybe even use it in production in some small way. If I do, I’ll write about it. I encourage others to give it a try and ponder how cool 2024 would have been if we used this tool instead of trashfire Python slop you’re all doomed to use in your day job.

22 comments

More suspect machine learning techniques

Posted in five minute university, machine learning by Scott Locklin on March 14, 2024

Only a few weeks after “Various Marketing Hysterias in Machine Learning” and someone took a big swing at SHAP. Looks like the baseball bat connected, and this widely touted technique is headed for Magic-8 ball land. A few weeks later: another baseball bat to SHAP. I have no doubts that many people will continue using this tool and other such flawed tools. People still write papers about Facebook-Prophet and it’s just a canned GAM model. Even I am a little spooked at how this went: I was going to fiddle with SHAP, but the python-vm contraptions required to make it go in a more civilized statistical environment was too much for me, so I simply made dissatisfied noises at its strident and confused advent (indicative of some kind of baloney), and called it a day. Amusingly my old favorite xgboost now has some kind of SHAP addon in its R package. Mind boggling as xgboost comes with importance sampling which tells you exactly which features are of importance by using the goddamned algorithm in the package!

This little SHAP escapade reminds me of a big one I forgot: t-SNE. This is one I thought should be cool because it’s all metric-spacey, but I could never get to work. I should have taken a hint from the name: t-distributed stochastic nearest neighbor embedding. Later a colleague at Ayasdi (names withheld to protect the innocent) ran some tests on our implementation and effectively proved its uselessness: it’s just a lame random number generator. This turkey was developed in part by neural eminence grise Geoff Hinton -you know, the guy making noise about how autocomplete is going to achieve sentience and kill us all. I think this is why it initially got attention; and it’s not a bad heuristic to look at a new technique when it is touted by talented people. Blind trust in the thing for years though, not so good. At this point there is a veritable cottage industry in writing papers making fun of t-SNE (and its more recent derivative UMAP). There are also passionate defenses of the thing, as far as I can tell, because the results, though basically random, look cool and impress customers. There has always been dimensionality reduction gizmos with visualization like this: Sammon mapping, Multidimensional Scaling (MDS), PacMAP, Kohonen maps, autoencoders, GTMs (PGM version of Kohonen maps), Elastic maps, LDA, Kernel PCA, LLE, MVU, things like IVIS, various kinds of non negative matrix factorization, but also …. PCA. Really you should probably just use PCA or k-means and stop being an algorithm hipster. If you want to rank order them: start with the old ones. Stop if you can’t solve your problem with one of these things which dates after ~2005 or so when the interbutts became how people learn about things: aka through marketing hysterias. I’ve used a number of these things and in real world problems …. I found Kohonen maps to be of marginal hand wavey utility: the t-SNE of its day I guess -almost totally forgotten now; also Kernel PCA, LLE, MDS. I strongly suspect Sammon mapping and MDS are basically the same, and that LDA (Fisher linear discriminants, though Latent Dirichlet seems to work too, it’s out of scope for this one) is probably a better use of my time to fiddle with.

undefined

I suspect t-SNE gets the air it does because it looks cool, not because it gives good answers. Rather than being relentlessly marketed, it sold itself because it easily produces cool looking sciency plots (that are meaningless) you can show to customers so you look busy.

Data science, like anything with the word “science” in the name, isn’t scientific, even though it wears the skinsuit of science and has impressive sounding neologisms. It’s sort of pre-scientific, like cooking, except half the techniques and recipes for making things are baloney that only work by accident when they do work.

Hilarious autism which I almost agree with

Some older techniques from the first or second generation of “AI” are illustrative as well. Most nerds have read Godel Escher Bach and most will go into transports about it. It’s a fun book, exposing the reader to a lot of interesting ideas about language and mathematical formalisms. Really though, it’s a sort of review of some of the ideas current in “AI” research in his day (Norvig’s PAIP is a more technical review of what actually was being mooted). The idea of “AI” in those days was that one could build fancy parsers and interpreters which would eventually somehow become intelligent; in particular, people were very hot on an idea called Augmented Transition Networks (ATNs) which he gabbles on about endlessly. As I recall the ATN approach fails on inflected languages, meaning if you could make a sentient ATN, this would imply that Russians, Ancient Greeks and Latin speaking Romans are not sentient, which doesn’t seem right to me, Julian Jaynes not withstanding. The idea seems absurd now, and unless you’re using lisp or json relatives (json is a sort of s-expression substitute: thanks Brendan), building a parser is hard and fiddley, so most people never think do it.

Some interesting things came of it; if you use the one-true-editor, M-x doctor will summon one of these things for you. Emacs-doctor/eliza is apparently a fair representation of a Rogerian psychologist: people liked talking to it. It’s only a few lines of code; if you read Winston and Horne (or Paul Graham’s fanfic of W&H) or Norvig you can write your own. People laugh at it now for some reason, but it was taken very seriously back in the day, and it still beats ChatGPT on classic Turing Tests.

Back then it was mooted that this sort of approach could be used to solve problems in general: the “general problem solver” was an early attempt (well documented in PAIP). There’s ancient projects such as Cyc or Soar which still use this approach; expert system shells (ESS -not to be confused with the statistical module for the one true editor) more or less. This is something I fooled around with on a project to give me an excuse for fiddling in Lisp. My conclusion was that an internal wiki was much more useful and easier to maintain than an ESS. These sorts of fancy parsers do have some utility; I understand they’re used to attempt to make sense of things like health insurance terms of service (health insurance companies can’t understand their own terms of service apparently: maybe they should make a wiki), mathematical proof systems, and most famously, these approaches led to technology like Maple, Maxima, Axiom and Mathematica. Amusingly the common lisp versions of the computer algebra ESS idea (Axiom and Maxima) kind of faded out, though Maple and Mathematica both have a sort of common lisp engine inside of them, proving Greenspun’s law, which is particularly apt for computer algebra systems.

Other languages were developed for a sort of iteration on the idea; most famously Prolog. All of these ideas were trotted out with the Fifth Generation Computing project back in the 80s, the last time people thought the AI apocalypse was upon us. As previously mentioned, people didn’t immediately notice that it’s trivial to make an NP-hard query in Prolog, so that idea kind of died when people did realize this. I dunno constraint solvers are pretty neat; it’s too bad there wasn’t a way to constrain it to not make NP-hard queries. Maybe ChatGPT or Google’s retarded thing will help.

yes, let’s ask the latest LLM for how to make Prolog not produce NP-hard constraint solvers

The hype chuckwagon is nothing new. People crave novelty and want to use the latest thing, as if we’re still in a time of great progress, such as when people were doing things like inventing electricity, quantum mechanics, airplanes and diesel engines. Those were real leaps forward, and the type of personality who was attracted to novel things got big successes using the “try all the new things” strategy. Now a days, we have little progress, but we have giant marketing departments putting false things into people’s brains. Nerds seem to have very little in the way of critical faculties to deal with this kind of crap. For myself, I’ve mostly ignored toys like LLMs and concentrate on …. linear regression and counting things. Such humble and non-trendy ideas work remarkably well. If you want to get fancy: regularization is pretty useful and criminally underrated.

I can’t understand how this can sound so ridiculous and yet be totally reasonable advice

Credit: https://t.co/I9bRJacEWe pic.twitter.com/QjoUU16TEG

— Sean J. Taylor (@seanjtaylor) July 22, 2022

Yeah, OK we have genuinely useful stuff like boosting now, also conformal prediction, both of which I think are genuine breakthroughs in ways that LLMs are not. LLMs are like those fiber optic lamps they used to sell to pot heads in the 70s at Spencers Gifts. Made of interesting materials which would eventually be of towering importance for data transmission, but ultimately pretty silly. Most would-be machine learning engineers should probably stick with linear regression for a few years, then the basic machine learning stuff; xgboost, kmeans. Don’t get fancy, you will regret. Definitely don’t waste your career on things you learned about from someone’s human informational centipede. Don’t give me any crap about “how can all those smart people be wrong” -they were wrong about nanotech, fusion, dork matter, autonomous vehicles, string theory and all the other generations of “AI” that didn’t work as well. Active workers in machine learning can’t even get obvious stuff like SHAP and t-SNE (and before these, prophet and SAX and GA and fuzzy logic and case based reasoning) right. Why should you believe modern snake oil merchants on anything?

Current year workers who are fascinated by novelty aren’t going to take it to the bank: you’re best served in current year by being skeptical and understanding the basics. The Renaissance came about not because those great men were novelty seekers: they were men of taste who appreciated the work of the ancients and expanded on them. So it will be in machine learning and statistics. More Marsilio Ficino, less Giordano Bruno.

29 comments

The Birthday paradox as first lecture

Posted in five minute university, fraud, statistical tools by Scott Locklin on February 15, 2024

The birthday paradox is one of those things that should be taught in grade school to banish superstition, bad statistics and mountebanks. Of course there are people who understand the birthday paradox and still consult astrologers, but knowledge of this fundamental idea of probability theory at least gives people a fighting chance. It’s dirt simple; if you ask people what the probability of there being a shared birthday in a group of n people is, they’ll probably estimate much too low.

The probability of n people not having the same birthday is a lot like calculating the probabilities of hands of cards. You end up with an equation like the following:

$P_1(n,d) = \frac{d!}{(d-n)! d^n}$

n is number of people in the room, d is number of days in the year. Note that this generalizes to any random quality possibly shared by people. The probability of a group of people sharing a birthday is:

$P_2(n,d) = 1 - \frac{d!}{(d-n)! d^n}$

You can try calculating this with your calculator, but 365! is a big number, and we can use a little calculus to make an approximation:

$P_2(n,d) = 1 - e^{-n(n-1)/2d }$

From here, anybody should be able to plug 365 in for d, then make the probability 50%, and get a solution of around 23 people; a counter intuitive solution. Probability with replacement works like that; coincidences like this are much more likely than our naive intuition implies. The naive intuition is you need 180 people in a room to have a 50% chance of shared birthday. I guess most people are narcissists and forget other people can have matching birthdays too. Probability theory is filled with counter-intuitive stuff like this; even skilled card players are often terrible at calculating real world odds involving such coincidences. Aka, if it’s a joint probability you’re probably calculating it wrong. If it’s something in the real world, it is defacto joint probability, even if you don’t think of it that way.

For the mathematically illiterate, $n! = n*(n-1)*(n-2)*....2 * 1$ . Thinking about what this (the factorial) is doing; the first guy has to compare himself to n-1 people, the second n-2, the third, n-3 and so on. Another way to think about it, there’s a 1/365 chance of you sharing a birthday with anybody. So, 364/365 of not sharing. You should be able to hand wave your way around that. You can take my word for the calculus approximation (google Stirling’s formula if you want to know more). If you stop to think about it a bit, the paradox is worse if you have an uneven distribution of probabilities for birth dates (aka more people are born in October or something), and of course is much worse if there are fewer possibilities (aka most things in life have fewer than 365 possibilities). Really they’re the same thing: uneven distributions are like removing possibilities.

The human mind is designed to match patterns; to ascribe meaning to seeming regularity. “Coincidences” such as the birthday paradox are like crack cocaine to the brain’s pattern matcher. The ancients had their books of portents, actions of birds and animals, liver condition of sacrifices to the Gods. The reality is, the bird livers had a very limited number of states; vastly more limited in distribution than 1 in 365. So of course you’ll see a lot of seemingly convincing coincidences. You’ll also forget about it when liver-divination doesn’t work just as you would with astrology or tarot cards. The ancients weren’t stupid, even if they never invented probability theory, and supernatural explanations seemed natural enough at the time, so all of this was convincing.

Very intelligent people, even scientists, are just as subject to this sort of thing as anyone else. There are a couple of books out there about the correspondence of Wolfgang Pauli and Carl Jung about what they called “synchronicity.” This is a two dollar word for noticing coincidences and ascribing meaning to them. Mind you Pauli invented large parts of quantum mechanics and was one of the most intelligent and famously bloody minded men of his time (Jung was more of a nutty artist type), yet he still fell for what amounts to a version of the birthday paradox, combined with an overactive imagination. Pauli was considered the conscience of physics; less charitably, he was called the “wrath of God;” he’d regularly chimp out at physics which was even mildly off. You can sort of understand where he was coming from: physics represented stability to him in a crazy time of Nazis and Communists. He even had to put up with his mother killing herself: something he did by taking up with a chorus girl and recreational alcoholism: Pauli was the most punk rock of early quantum mechanics. He made up some vague hand wavey bullshit about quantum entanglement which is possibly also a mystical bullshit concept in itself, because many of the early quantum mechanics found themselves in similar circumstances. I know I’m more prey to mystical bullshit when hungover or otherwise in a psychologically fragile state. Mind you this is a guy who would chimp at other physicists for leaving out an $\hbar$ from an equation.

This bullshit got me lurid romantic encounters with countless goth girls and strippers while in grad school: thanks science bros

There is something used by confidence trickters and stage magicians practicing mentalism related to this; in a group of people, getting an impressive cold read on one of them is pretty trivial. Fortune tellers, astrologers, occultists and quasi-religious entrepreneurs of all kinds use this technique. You use likely coincidences to build rapport until the mark is cooperating with you in playing psychic man, and basically giving the answers with body language and carefully constructed questions. People have no conception of how probabilities work, so they practically hypnotize themselves when one of the mentalists “I see a woman in your life, an older woman…” patter strikes home. There are plenty of numskulls who believe in such nonsense without overt mountebanks misleading them: turns out people who demonstrably suck at probabilistic reasoning are likely to believe all kinds of stupid nonsense.

If you work in statistics or machine learning, this sort of thing is overfitting. All statistical models and machine learning algorithms are subject to this. For a concrete example, imagine you made multiple hypothesis tests about a piece of data (machine learning is essentially this, but let’s stick with the example). The p-value is defined as the probability for your test being an accidental coincidence for doing one hypothesis test. You see where I’m going here, right? If you do many hypothesis tests, just like if you do many comparisons between all the people in the room, the p-values of any of them are not estimated properly. You are underestimating the false discovery rate, just as you are underestimating the birthday group probability: the combinatorics makes coincidences happen more often.

The very existence of this problem escaped statisticians for almost a century. I think this happened because statistical calculations were so difficult with Marchant calculators and human computers when Fisher and people like him were inventing statistics as a distinct branch of mathematics, they’d usually only do one estimate, which is what the p-value was good for. Later on when computers became commonplace, statisticians were so busy doing boatloads of questionable statistics in service of the “managerial elite,” they forgot to notice p-values are underestimated when you’re doing boatloads of questionable statistics. Which is one of the reasons why we have things like the reproducibility crisis and a pharmacopoeia that doesn’t confer any benefit on anyone but shareholders. Now at least we are aware of the problem and have various lousy ad-hoc ways of dealing with this issue in the Bonferroni correction (basically you multiply p-values by the number of tests -not always possible to count and not great but better than nothing), and the Benjamini-Hochberg procedure (a fancier version of Bonferroni). There are other ideas for fixing this: q-values and e-values most prominent among them; most haven’t really escaped the laboratory yet, and none of which have made it into mainstream research in ways which push the needle even assuming they got it right. The important takeaway here is very smart people, including those whose job it is to deal with problems like this, don’t understand the group of ideas around the birthday paradox.

People in the sciences have called for the publications of negative results. The idea here is if we knew all the different things people looked at with negative results, we could weight the positive results with something like Bonferroni corrections (also that people who pissed up a rope with a null experiment get credit for it). Of course my parenthetical “it’s not always possible to count” thing comes into play here: imagine everyone who ever ran a psychology experiment or observational study published null results: which ones do you count as relevant towards the one you’re calculating p-values for? What if 10,000 other people ran the experiment, got a null and forgot to mention it? Yep, you’re fucked as far as counting goes. Worse than all this, of course, is the nature of modern academia is such that fraud is actively encouraged: as I have said, I used to listen to people from the UCB Psychology department plotting p-mining fraud in writing their ridiculous papers on why you’re racist or why cutting your son’s balls off is good for him or whatever.

Trading algorithms are the most obvious business case where this comes into play, and there are tools to deal with the problem. One of the most famous is White’s Reality Check, which uses a sort of bootstrap algorithm to test whether your seemingly successful in-sample trading algorithm could have been attributed to random chance. There are various other versions of this; Hansen’s SPA, Monte-Carlo approaches. None are completely satisfying for precisely the same reason writing down all the negative science experiments isn’t quite possible. If you brute forced a technical trading algorithm, what about all the filters you didn’t try? What do you do if you used Tabu search or a machine learning algorithm? Combinatorics will mess you up you every time with this effect if you let it. White’s reality check wasn’t written down until 2005; various systematic trading strategies have been around at least 100 years before and explicitly are subject to this problem. It’s definitely a non-obvious problem if it took trader types that long to figure out some kind of solution, but it is also definitely a problem.

The seven degrees of Kevin Bacon effect is the same thing, though “network science” numskulls make a lot of noise about graph topology: it doesn’t really matter as long as the graph is somewhat connected (yes I can prove this). Birthday paradox attacks on cryptographic protocols are also common. Probably the concept is best known today because of birthday attacks on hashing functions.

It seems like humans should have evolved better probability estimators which aren’t confused by the birthday paradox. People who estimate probabilities accurately obviously have advantages over those who don’t. Someone wrote a well regarded book (which ironically flunked reproducibility) on this: Thinking Fast and Slow. The problem is that Kahneman’s “type 2” probability estimator (the one associated with conscious thought) is generally just as bad as the “type-1” (instinctive estimator) it derives from. The brain is an overgrown motion control system, so there is no reason the type-2 probability estimator is going to be any good, even if it involved a lot of self reflection. Type-2 thinking, after all, is what got us Roman guys looking at livers to predict the future (or Kahneman’s irreproducible results). Type-2 is just overgrown type-1, and the type-1 type gizmo in your noggin is extremely good at keeping human beings alive using its motion control function, so it’s difficult to overcome its biases. You don’t need to know about the birthday paradox to avoid being eaten by lions or falling off a cliff. But you definitely need to know about it for more complicated pattern matching.

64 comments

Older Posts »

Locklin on science

Coding assistant experience

Some claude-code notes:

qwen3-coder-next:

random unconnected thoughts:

Productivity thoughts:

Lush: my favorite small programming language

More suspect machine learning techniques

The Birthday paradox as first lecture

About me:

Past blogs

Email Subscription

RSS link thingee