machine learning | Locklin on science

More suspect machine learning techniques

Posted in five minute university, machine learning by Scott Locklin on March 14, 2024

Only a few weeks after “Various Marketing Hysterias in Machine Learning” and someone took a big swing at SHAP. Looks like the baseball bat connected, and this widely touted technique is headed for Magic-8 ball land. A few weeks later: another baseball bat to SHAP. I have no doubts that many people will continue using this tool and other such flawed tools. People still write papers about Facebook-Prophet and it’s just a canned GAM model. Even I am a little spooked at how this went: I was going to fiddle with SHAP, but the python-vm contraptions required to make it go in a more civilized statistical environment was too much for me, so I simply made dissatisfied noises at its strident and confused advent (indicative of some kind of baloney), and called it a day. Amusingly my old favorite xgboost now has some kind of SHAP addon in its R package. Mind boggling as xgboost comes with importance sampling which tells you exactly which features are of importance by using the goddamned algorithm in the package!

This little SHAP escapade reminds me of a big one I forgot: t-SNE. This is one I thought should be cool because it’s all metric-spacey, but I could never get to work. I should have taken a hint from the name: t-distributed stochastic nearest neighbor embedding. Later a colleague at Ayasdi (names withheld to protect the innocent) ran some tests on our implementation and effectively proved its uselessness: it’s just a lame random number generator. This turkey was developed in part by neural eminence grise Geoff Hinton -you know, the guy making noise about how autocomplete is going to achieve sentience and kill us all. I think this is why it initially got attention; and it’s not a bad heuristic to look at a new technique when it is touted by talented people. Blind trust in the thing for years though, not so good. At this point there is a veritable cottage industry in writing papers making fun of t-SNE (and its more recent derivative UMAP). There are also passionate defenses of the thing, as far as I can tell, because the results, though basically random, look cool and impress customers. There has always been dimensionality reduction gizmos with visualization like this: Sammon mapping, Multidimensional Scaling (MDS), PacMAP, Kohonen maps, autoencoders, GTMs (PGM version of Kohonen maps), Elastic maps, LDA, Kernel PCA, LLE, MVU, things like IVIS, various kinds of non negative matrix factorization, but also …. PCA. Really you should probably just use PCA or k-means and stop being an algorithm hipster. If you want to rank order them: start with the old ones. Stop if you can’t solve your problem with one of these things which dates after ~2005 or so when the interbutts became how people learn about things: aka through marketing hysterias. I’ve used a number of these things and in real world problems …. I found Kohonen maps to be of marginal hand wavey utility: the t-SNE of its day I guess -almost totally forgotten now; also Kernel PCA, LLE, MDS. I strongly suspect Sammon mapping and MDS are basically the same, and that LDA (Fisher linear discriminants, though Latent Dirichlet seems to work too, it’s out of scope for this one) is probably a better use of my time to fiddle with.

undefined

I suspect t-SNE gets the air it does because it looks cool, not because it gives good answers. Rather than being relentlessly marketed, it sold itself because it easily produces cool looking sciency plots (that are meaningless) you can show to customers so you look busy.

Data science, like anything with the word “science” in the name, isn’t scientific, even though it wears the skinsuit of science and has impressive sounding neologisms. It’s sort of pre-scientific, like cooking, except half the techniques and recipes for making things are baloney that only work by accident when they do work.

Hilarious autism which I almost agree with

Some older techniques from the first or second generation of “AI” are illustrative as well. Most nerds have read Godel Escher Bach and most will go into transports about it. It’s a fun book, exposing the reader to a lot of interesting ideas about language and mathematical formalisms. Really though, it’s a sort of review of some of the ideas current in “AI” research in his day (Norvig’s PAIP is a more technical review of what actually was being mooted). The idea of “AI” in those days was that one could build fancy parsers and interpreters which would eventually somehow become intelligent; in particular, people were very hot on an idea called Augmented Transition Networks (ATNs) which he gabbles on about endlessly. As I recall the ATN approach fails on inflected languages, meaning if you could make a sentient ATN, this would imply that Russians, Ancient Greeks and Latin speaking Romans are not sentient, which doesn’t seem right to me, Julian Jaynes not withstanding. The idea seems absurd now, and unless you’re using lisp or json relatives (json is a sort of s-expression substitute: thanks Brendan), building a parser is hard and fiddley, so most people never think do it.

Some interesting things came of it; if you use the one-true-editor, M-x doctor will summon one of these things for you. Emacs-doctor/eliza is apparently a fair representation of a Rogerian psychologist: people liked talking to it. It’s only a few lines of code; if you read Winston and Horne (or Paul Graham’s fanfic of W&H) or Norvig you can write your own. People laugh at it now for some reason, but it was taken very seriously back in the day, and it still beats ChatGPT on classic Turing Tests.

Back then it was mooted that this sort of approach could be used to solve problems in general: the “general problem solver” was an early attempt (well documented in PAIP). There’s ancient projects such as Cyc or Soar which still use this approach; expert system shells (ESS -not to be confused with the statistical module for the one true editor) more or less. This is something I fooled around with on a project to give me an excuse for fiddling in Lisp. My conclusion was that an internal wiki was much more useful and easier to maintain than an ESS. These sorts of fancy parsers do have some utility; I understand they’re used to attempt to make sense of things like health insurance terms of service (health insurance companies can’t understand their own terms of service apparently: maybe they should make a wiki), mathematical proof systems, and most famously, these approaches led to technology like Maple, Maxima, Axiom and Mathematica. Amusingly the common lisp versions of the computer algebra ESS idea (Axiom and Maxima) kind of faded out, though Maple and Mathematica both have a sort of common lisp engine inside of them, proving Greenspun’s law, which is particularly apt for computer algebra systems.

Other languages were developed for a sort of iteration on the idea; most famously Prolog. All of these ideas were trotted out with the Fifth Generation Computing project back in the 80s, the last time people thought the AI apocalypse was upon us. As previously mentioned, people didn’t immediately notice that it’s trivial to make an NP-hard query in Prolog, so that idea kind of died when people did realize this. I dunno constraint solvers are pretty neat; it’s too bad there wasn’t a way to constrain it to not make NP-hard queries. Maybe ChatGPT or Google’s retarded thing will help.

yes, let’s ask the latest LLM for how to make Prolog not produce NP-hard constraint solvers

The hype chuckwagon is nothing new. People crave novelty and want to use the latest thing, as if we’re still in a time of great progress, such as when people were doing things like inventing electricity, quantum mechanics, airplanes and diesel engines. Those were real leaps forward, and the type of personality who was attracted to novel things got big successes using the “try all the new things” strategy. Now a days, we have little progress, but we have giant marketing departments putting false things into people’s brains. Nerds seem to have very little in the way of critical faculties to deal with this kind of crap. For myself, I’ve mostly ignored toys like LLMs and concentrate on …. linear regression and counting things. Such humble and non-trendy ideas work remarkably well. If you want to get fancy: regularization is pretty useful and criminally underrated.

I can’t understand how this can sound so ridiculous and yet be totally reasonable advice

Credit: https://t.co/I9bRJacEWe pic.twitter.com/QjoUU16TEG

— Sean J. Taylor (@seanjtaylor) July 22, 2022

Yeah, OK we have genuinely useful stuff like boosting now, also conformal prediction, both of which I think are genuine breakthroughs in ways that LLMs are not. LLMs are like those fiber optic lamps they used to sell to pot heads in the 70s at Spencers Gifts. Made of interesting materials which would eventually be of towering importance for data transmission, but ultimately pretty silly. Most would-be machine learning engineers should probably stick with linear regression for a few years, then the basic machine learning stuff; xgboost, kmeans. Don’t get fancy, you will regret. Definitely don’t waste your career on things you learned about from someone’s human informational centipede. Don’t give me any crap about “how can all those smart people be wrong” -they were wrong about nanotech, fusion, dork matter, autonomous vehicles, string theory and all the other generations of “AI” that didn’t work as well. Active workers in machine learning can’t even get obvious stuff like SHAP and t-SNE (and before these, prophet and SAX and GA and fuzzy logic and case based reasoning) right. Why should you believe modern snake oil merchants on anything?

Current year workers who are fascinated by novelty aren’t going to take it to the bank: you’re best served in current year by being skeptical and understanding the basics. The Renaissance came about not because those great men were novelty seekers: they were men of taste who appreciated the work of the ancients and expanded on them. So it will be in machine learning and statistics. More Marsilio Ficino, less Giordano Bruno.

29 comments

Various marketing hysterias in machine learning

Posted in five minute university, machine learning by Scott Locklin on December 17, 2023

One of the un-spoken of problems of the contemporary sciences is marketing and public relations disinformation. Even when it isn’t obviously done by someone bribing a journalist or paying a PR firm it is a problem. Noodle theory got a lot of positive PR from media and book-writers for decades, and probably wouldn’t have gotten as far as it did without this sort of self-referential attention. It’s not like people were writing popular books on condensed matter physics; half of the high energy stuff that works was stolen from condensed matter physics. Nobody cares about nerds fooling around with shittonium on silicon-111; they solve real problems -no glory in that when you can look into the mind of gaaaawd with noodle theorists. More ominously, many universities, individuals and research institutions employ marketing agencies to get their research to the attention of the general public. This causes many distortions in the marketplace of ideas. Students will go into fields based on marketing nonsense or pop science baloney: chaos theory had a popular book before I studied it in school. This wasn’t terrible for me: classical dynamics is a great subject, but it could have been a problem. Larval Locklin wouldn’t have known any better and blindly trusted that James Gleick knew what he was talking about (he didn’t). In ye olden days you’d hear about a piece of research if it was important. Now, you will often hear about research because someone payed a marketer to tell you about it. Or else a dumb journalist was fucking the researcher’s sister.

I can think of a couple of examples of distorting marketing bullshit from my present work in statistics and machine learning. This is a comparatively humble and obscure field with pragmatic practitioners who care about good results. LLMs and autonomous vehicles ought to come to mind as the most obvious such memes as these things are obviously shilled beyond all reason by companies who hope to one day profit from them, hopefully by selling out to a larger company who is in fear of missing out. I assume by now that most of you have the good sense to laugh these to scorn as marketing constructions, but there are many more smaller scale distortions which I’ve come across.

Facebook Prophet is one that comes to mind. I remember this being trumpeted to the skies as a fully automated “deal with any timeseries” tool which works “at scale.” I looked at it when it came out in 2017 and decided if I wanted a trend-GAM with splines, it was a one liner in R and whatever quasi-automated BS they came up with was probably only suited to counting keyword impressions on FB. People continually hounded me with this marketing crap, asking NPC questions like “why do you study timeseries models; FB prophet made them obsolete.” This is an actual statement I received from several people who are not obvious mental defectives and who were productively employed in the field of data science. This dumb model continues to occupy disproportionate mindspace: even current year 2023 you still get surprised blogs noticing that canned ARIMA smokes FB Prophet on run of the mill problems. People continue to publish peer reviewed articles like this, which is borderline absurd. Why not compare FB prophet to a one-liner trend-GAM with splines in R? I guess it is a nice to have tool for lazy morons, and GAMs are OK on this class of problem, but had they simply improved R or Python’s tooling for trendy-GAM models, rather than carving out dumbass-mind-space of its own, I wouldn’t mind so much. There’s nothing special or innovative about it. I wrote a trend-GAM/seasonal thing for somebody which probably worked better for counting ad-word impressions because it pegged concepts to weird holiday seasonalities, which mostly dominate websites which are selling something. I don’t think Prophet can do that. Shilling this very pedestrian time series model as something revolutionary makes its authors look like mountebanks. The only way the world can heal is if we make fun of these swindlers for making their ridiculously generic models out to be something innovative or even particularly useful.

Facebook prophet; also good for scalp health

Another one which came a few years before was MINE. I remember it was from a bunch of MIT and Harvard dorks, and I remember someone pushing it hard as if it were the discovery of some new and wonderful thing, rather than a weaksauce turd which apes mutual information or distance correlation and takes a hell of a lot longer to calculate. It stands out in memory because I had to fuck with classpaths to make it work, but the all consuming hysteria around it was really something too. I also recall being extremely annoyed it’s no different from any other kind of Mutual Information or distance correlation tool which has existed from the time of magnetic core memory and vacuum tube CPUs. 12 years later, it is obvious now that it was a giant nothingburger. Nobody uses it. It never solved any interesting previously unsolvable problem. I’m not sure if the people responsible for it pushed it with a PR campaign funded by themselves, or if it was MIT. I suspect a little of both, since the website for it is still up. It was touted as being appropriate for “large datasets” but it sure chugged hard on small ones. Looking up the dudes who thought it up, one of them publishes lame papers on stats at MIT, the other has a Soros scholarship to become a doctor. They’re both MD/PhDs and don’t do much to improve my modest disdain for such people (I know of only one good one).

SAX and the various doodads Keogh comes up with also harsh on my mellow. I was excited by SAX when I first became aware of it. SAX is similar to an idea of my own creation. I thought a professor of whatever like Keogh is might have had some motivating reason to do SAX the way he did instead of equal width buckets or kmean-trees or whatever. If there was such a reason to use SAX over any of the numerous other arbitrary methods, I am unaware of it. There are numerous other things which work like SAX. But look at the testimonials! The home page for SAX reads much like the testimonials for questionable nutritional supplements in form and function. In actual fact, SAX is a sort of histogram which allows you to represent the timeseries as a string. In other words it is simply discretization. For timeseries you’re generally only interested in some sort of longer term behaviors; most of the data points are redundant and polluted with uninteresting noise. Filters are one standard way of dealing with the noise; only keep the frequencies you’re most interested in. Putting the data points in buckets is another sort of filter which has interesting properties. Once you realize you’ve reduced your timeseries to a sequence of symbols (something you can’t do with filters; you need a discretization technique), you can then apply the numerous techniques associated with strings of symbols. This isn’t a big insight: I managed to have it on my own when working with Mutual Information as a variable selection technique. You can then use classic autocomplete doodads like Tries or bag of words to do predictions or look for motifs or whatever. Keogh makes a big deal that his discretization technique is bounded by Euclidean distance, but, like, who cares? If you fuzz the data into enough buckets…. of course it is bounded by Euclidean distance. In his original paper and marketing materials he also compares his invention to wavelets, DFT and piecewise linear approximations. This is nonsense. Wavelets and DFT move the problem into frequency space and are not discretizations. Piecewise linear solves another problem; it’s more like a weird filter to capture short term trend. Discretization is of course very different from these techniques, but the particular SAX discretization isn’t a unique or interesting way of putting the data into discrete buckets. It is just one of zillions of potential ways of doing this. The marketing website for SAX spurred all kinds of SAX papers using various kinds of symbol predictors, variations of SAX with slightly different properties. If he had simply said “use a histogram” like people have literally been doing since the 1960s (and probably before), nobody would ever have heard of it.

Keogh’s histogram remedy

Information geometry; I don’t think this one ever rose to the level of marketing hysteria, but it seemed like a lot of people were touting it for a while. I read a bunch of books on it (basically, all of them), looking for something that resembled an application. The most I was able to come up with was fiddling with Hessians in doing likelihood on weird functions. I guess this is OK, but in general it seems like you shouldn’t do likelihood on weird functions; you should use better functions or empirical likelihood. I have yet to figure out why people sperg out on it so much, and I don’t actually know enough differential geometry to say anything authoritative, but it smells like wankery. For contrast, I have done useful things using TDA, though most of them could have been done some other way. It’s possible IG is like TDA, potentially useful if you know it already, but you could just solve the problem another way.

Genetic algorithms are a marketing hysteria from before my time. GA are a technique for searching a space; a gradient-free optimizer. The field is called metaheuristic optimization. It’s cute they use quasi-evolutionary techniques to segment and search the space, and they can occasionally produce better answers than the much older simulated annealing techniques. There are many alternatives; particle swarm optimizers, differential evolution, quasi-newton, tabu search, ant colony optimization. In my experience they’re all essentially the same; if you’re out on a limb where fiddly little differences between them matter, you’re probably doing something wrong. I generally just reach for differential evolution because the constraint handler is clean. There is nothing wrong with GA. Other than the fact that they seemed to dominate the literature back in the 80s and 90s. There were startups based on GAs back in the 80s, and they probably blew part of their budget on marketing, so people attributed magical qualities to this optimizer as opposed to, say, tabu search, which is of similar vintage. People really did solve problems with the tool, but it’s not clear they couldn’t have solved the same problems with any number of other tools. This is somewhat like SAX -an arbitrary solution, puffed up in importance by marketing budgets. Many of the kinds of problems solved by GA and friends fall into the NP-complete bucket, so I guess people could have become over-impressed with the sayings of computational complexity theory nerdoids (my opinion of them), but there are many such tools and people only got excited about GA. It wasn’t because it was first!

GA is good for what ails you

Things I’m leery of, but haven’t thought deeply about: SHAP (looks ad-hoc; many such techniques without the hysteria), matrix profiles (because Keogh basically), EMD (there’s lots of ways of doing blind source separation: non-negative matrix factorization is the real trick), DTW (probably has its place but seems over-referenced) anything Julia related (dorks) …. feel free to make additions.

49 comments

Aggregating experts: another great ML idea you never heard of

Posted in machine learning by Scott Locklin on October 12, 2022

To my shock and amazement, conformal prediction has been going bonkers since I mentioned it; thousands of papers a year instead of a few hundred from a smallish clique of people. I guess I either memed it into popularity or (more likely) my interest in it (dating back to 2012, documented in 2014 mind you) is part of a wider trend. There’s a github “awesome” repo, and all kinds of code checked into the repos. I’m deeply amused that big names like Michael Jordan (the stats prof from Berks not the basketball-American) and Ryan Tibshirani are involved in this topic now, and I will request autographs if they actually somehow heard about it from little old me. I had at various points between 2015 and 2017 considered dedicating the rest of my career to popularizing the idea and applying it to useful problems in industry, but as often happens I got distracted by important things in life and now I work on something else where I hope to occasionally find an excuse to use it.

Back in my bestiary of underappreciated machine learning techniques I had sort of alluded to Aggregating Experts, even mentioning Universal Prediction. Universal prediction has a technical definition which I will not bore people with. It’s easier to talk about how it works by making reference to Thomas Cover and Paul Algoet’s “Universal Portfolio,” which is better known for some reason. Cover-Algoet Universal Portfolios are designed to track the best possible constantly rebalanced portfolio which, contra that prune faced Sveriges Riksbank Prize winning mathematical-dimwit, Paul Samuelson, is log optimal. The idea is this algorithm would maximize the portfolio (in various limits) to win no matter what the underlying stonks are up to. Universal prediction is doing the same thing on your prediction by using a portfolio of experts, aggregated together in a special way.

The actual aggregating algorithms are various, and I suppose one could characterize them as details. If you know about various ways (there are various ways) of implementing Universal Portfolios, you know about Universal Prediction. For various reasons, Universal Portfolios are mostly uninteresting in the investment world. It’s really kind of a trick of information theorists. This is a very good trick for machine learning though because there are all kinds of situations in which you have a basket of shitty machine learning algorithms you need to turn into the best possible prediction.

The techniques used will be familiar to most machine learning nerds; things that look like KNN or ridge regression or Kernel methods or exponential smoothing or Newton’s method (some look like classical Lempel-Ziv compression on quantized/tokenized quantities, which is also a Universal Predictor). Putting them together in the correct way makes the magic happen. There’s a fairly old book on the topic, Perdition Burning and Flames by Niccolo Cesa-Bianchi and Gabor Lugosi; most people’s eyes glaze over when they read it, but I’ll never forget coming across it the first time and wondering why this sort of thing wasn’t considered more important. Of course a lot of the same guys that brought us conformal prediction are responsible for this stuff. Laszlo Gyorfi is particularly choice in his contributions, IMO. I assume he was Lugosi’s advisor at some point -he’s also a pioneer in non-parametric statistics who worked with Wolfgang Härdle, which is a sign of unimpeachable credibility with me as non-parametric statistics is best kind of statistics.

All this is blah blah without examples, so I’ll provide one from the excellent opera package in R. It’s perhaps not the best example in the world, but as I am incredibly lazy and it came with the package, it will do. Here’s your fit and hold out set for the thing you’re trying to predict. There are numerous independent variables; temperature, day of week, wind variables, daily price of electricity, etc.

Imagine you have to predict next day power usage and you have a couple of fairly different models which are merely OK at this; GBM (because GBM is always pretty good), an AR model and a classic Hastie/Tibshirani generalized additive model. You want to combine them in a way that makes the best 1 step ahead prediction. Here’s actuals with your arsenal of OK 1-step ahead predictors:

You can do all kinds of things with this, but you can see averaging them together probably ain’t the best way as it will be biased low. Your best model is pretty good, but sometimes the other ones do better. Wat do? Again, there’s lots of ways of combining these guys that are mostly guaranteed to do better than the best predictor. There’s something called Bernstein Online Averaging which is proven to work in batch mode (you can think of it as a fancy form of exponential smoothing for solving Cover portfolios of learners), which is more or less what we’re doing here.

How’d we do?

You can see, averaging using BOA we get a result better than the best prediction. Might not look like much of a boost in this example, but in many business contexts, including the one chosen, it’s absolutely huge. It’s also one of those things that makes no sense as a possibility; how can you take 3 things which are OK and get something better by mixing them properly? Boosting itself sort of works this way, but what makes this clever is the time series nature of the solution. Here’s an interesting view of the components over time.

This view is suggestive of something you can do with this sort of tool; model what’s really happening. In this case the AR model is just a classical fit AR model. Both GAM and GBM use dependent variables; GBM uses more dependent variables. You can sort of think of the graph as contributions of different generating processes; sometimes the dependent variables in the GBM model count more than others; sometimes it’s a simple AR(n) thing.

The idea potentially really pays off when you have a really hairy data set and fairly crummy predictors that are uncorrelated to each other; just like the Cover portfolio idea (or any sane portfolio idea) pays off better when you have a bunch of things which are uncorrelated. It’s also just interdasting: one of those “Stone Soup” ideas that looks like the Parrondo paradox. It’s an older idea than Conformal prediction, and I think more people have had a look at it (again, it’s a lot of the same crowd), but the idea is far from petered out. As David Thompson put it:

Experience with real-world data, however, soon convinces one that both stationarity and Gaussianity are fairy tales invented for the amusement of undergraduates.

For example, in the paper I got the quote from, a bunch of Israeli guys did a similar exercise for ARMA models, using an online Newton approach (a very good approach for ARMA). That’s right; ARMA that works when the generating process shifts around. Hoi and company (who did a nice survey of Cover portfolios) supposedly generalized it to ARIMA which is even better, as I always hated Vector Auto Regression models and would like to use something better.. Now they’re looking at neural approaches using the idea. There are further generalizations for Bandit problems, active learning; it’s probably applicable to transfer learning as well, at least in a timeseries context. There’s also a Yule-Walker like systemization of which ones to use at which situation/classifier: just as there are recipes for which Cover approximator to use depending on how your market components behave.

I didn’t explain how this works, much less how people thought of it. The Game Theoretical background behind it is pretty cool too. Imagine you have an unknown enemy playing a game of non-transitive dice against you. Your enemy can see everything. You can only see whether you win or lose. Which dice should you roll on a sequence? I thought about walking the reader through the gruesome details, but since most folks don’t know what non-transitive dice are; go buy yourselves a set and work it out on your own. Do check out Perdition Burning and Flames as well.

https://cran.r-project.org/web/packages/opera/vignettes/opera-vignette.html

http://bactra.org/notebooks/universal-prediction.html

I think this is same idea in Python for those of you with that particular moral defect:
View at Medium.com

16 comments

Data is not the new oil: a call for a Butlerian Jihad against technocrat data ding dongs

Posted in machine learning, Progress, stats jackass of the month by Scott Locklin on November 5, 2020

I tire of the dialog on “big data” and “AI.” AI is an actual subject, but as used in marketing and press releases and in the babbling by ideologues and think tank dipshits, the term is a sort of grandiose malapropism meaning “statistics and machine learning.” As far as I can tell “big data” just means the data at one point lived in something other than a spreadsheet.

“BigDataAI” ideology is a continuation of the program of the technocratic managerial “elite.” To those of you who are unfamiliar with the work of James Burnham, there is a social class of technocratic “experts” have largely taken over the workings of society in the West; a process which took place in the first half of the 20th century. While there have always been bureaucrats in civilized societies, the ones since around the time of Herbert Hoover have aped “scientific” solutions even where no such thing is actually possible. This social class of bureaucrats has had some mild successes; the creation of the American highway system, public health initiatives against trichinosis, US WW-2 production. But they have mostly discredited themselves for decades: aka the shitty roads in America, the unaffordable housing in major urban centers, a hundred million fat diabetics, deindustrialization because muh free market reasons, the covidiocy and most recently, the failure of every noteworthy technocrat in the world’s superpower to predict election outcomes and even its ability to honestly count its votes. Similar social classes interested in central planning also failed spectacularly in the Soviet Union, and led to the cultural revolution in China. There are reasons both obvious and deep as to why these social classes have failed.

The obvious reason is that mandarinates are inherently prone to corruption when there are no consequences for their failures. Bureaucrats are wielders of power and have the extreme privilege of collecting a pension on the public expense. Various successful cultures had different ways of keeping them honest; the Prussians and pre-Soviet Russian bureaucrats recruited from honor cultures. Classical China and the early Soviets did it via fear. The Soviet Union actually worked pretty well when the guys from Gosplan could be sent to the Gulag for their failings (or because Stalin didn’t like their neckties -keeps them on their toes). It progressively fell apart as it grew more civilized; by the 1980s, nobody was afraid of the late night knock on the door, and the Soviet system fell apart when the US faked like it was going to build ridiculous space battleships. The rise of China has largely been the story of bureaucratic reforms by Deng where accountability (and vigorous punishment for malefactors) were the order of the day. Singapore makes bureaucrats meet regularly with their constituents; seems reasonable -don’t know why every society doesn’t make this a requirement. It is beyond question the American equivalent of the Gosplan mandiranate is almost unimaginably corrupt at this point, and the country is falling apart as a result.

While it gives policy-makers a sense of agency having a data project, consider that there isn’t a single large scale data project beyond the search engine that has improved the lives of human beings. Mind you, the actual civilizational utility of the search engine is highly questionable. What improvement in human living standards has come of the advent of google in the last 20 years? The only valuable content on the internet is stuff made by human beings. Google effectively steals or destroys most of the revenue of content creators who made the stuff worth looking at in the first place. Otherwise, library science worked just fine without blue haired Mountain View dipshits running SVD on a link graph. INSPEC (more or less; dmoz for research) is 120 years old and is still vastly better for research than google scholar. Science made more progress then between 1898 and 2005 or so when google more or less replaced it: and the news wasn’t socially toxic clickfarming idiocy back when the CIA censored the news instead of google komissars with facial piercings. These days google even sucks at being google; I generally have more luck with runaroo or just going directly to things on internet archive.

If “AIBigData” were so wonderful, you’d see its salutary effects everywhere. Instead, a visit to the center of these ideas, San Francisco is a visit to a real life dystopia.There are thousands of data projects which have made life obviously worse for people. Pretty much all of nutrition and public health research post discovery of vitamins, and statisticians telling people not to drink toilet water is worthless or actively harmful (look at all those fat people waddling around). Most biomedical research is false, and most commonly prescribed drugs are snake oil or worse. Various “pre-crime” models used to justify setting bail or prison sentences are an abomination. The advertising surveillance hellscape we’ve created for ourselves is both aesthetically awful and a gigantic waste of time. The intelligence surveillance hellscape we’ve created mostly keeps its crimes secret, and does nothing obviously helpful. Annoying advertising invading every empty space; I don’t want to watch ads to pump gas or get money from my ATM machine. Show me something good these dorks have done for us; I’m not seeing it. Most of it is moronic overfitting to noise, evil or both.

It’s less obvious but can’t be stated often enough: often “there is no data in your data.” The technocracy’s mathematical tools boil down to versions of the t-test being applied to poorly sampled and/or heteroskedastic data where they may not be meaningful. The hypothesis under test may not have a meaningful null no matter how much data you collect. When they talk about “AI” I think it’s mostly aspirational; a way out of heteroskedasticity and actual randomness. It’s not; there are no “AI” t-tests in common use by these knuckleheads, and if there were, the upshot wouldn’t look that much different from 1970s era stats results. When they talk about big data, they don’t talk about $\frac{1}{\sqrt{n}}$ , or issues like ROC curves and bias variance tradeoff. They certainly never talk about data which is heteroskedastic or simply random, which is most of it.

In reality, data collection is mostly useless. In intelligence work, in marketing, political work: most of it is completely useless, and collecting it and acting on it is a sort of cargo cult for DBAs, cloud computing saleslizards, technocratic managerial nerds, economists, Nate Silver and other such human refuse. Once in a while it pays off. More often, the technocrat will take credit when things go his way and make complicated excuses when they don’t; just look at Nate Silver’s career for example; a clown with a magic 8-ball. There’s an entire social class of “muh science” nerds who think it a sort of moral imperative to collect and act on data even if it is obviously useless. The very concept that their KPIs and databases might be filled with the sheerest gorp …. or that you might not be able to achieve marketing uplift no matter what you do… doesn’t compute for some people.

Technocratic data people are mostly parasitic vermin and their extermination, while it would cut into my P/L, would probably be good for society. At the very least we should make their salaries proportional to (1- Brier) scores; that will require them to put error bars on their predictions, reward the competent and bankrupt the useless. Really though, they should all be sent to Idaho to pick potatoes. Or ….

84 comments

Older Posts »

Locklin on science

More suspect machine learning techniques

Various marketing hysterias in machine learning

Aggregating experts: another great ML idea you never heard of

Data is not the new oil: a call for a Butlerian Jihad against technocrat data ding dongs

About me:

Past blogs

Email Subscription

RSS link thingee