Friday posts are for jokes and philosophical musings, so here’s one of the latter. I have been wondering away for a while at a phenomenon called “emergent misalignment” in the management and training of LLMs. This is the phenomenon where you train a model to do one bad thing (in this case, to write insecure computer code), and then you find that it has also learned a lot of other nasty things – the chatbot is much more likely to give harmful or unethical advice on questions unrelated to computer code.
You can see the same phenomenon at work with the repeated fiascos of xAI’s “Grok” chatbot. On several occasions, someone has decided to try and tweak the model to be a bit less “woke”, in the sense of more tolerant of a few forms of socially accepted bigotry, and ended up with it saying “I am MechaHitler” or compulsively switching the conversation to white genocide. If you try to make a machine nastier in one specific way, it seems very difficult to stop it getting nastier in lots of other ways.
There’s a pretty natural (although I guess not necessarily correct!) interpretation of this phenomenon, assuming it’s real. Which is to say that if you think of the neutral network in spatial terms, as a way of fitting a hyperplane through a vector space[1], there is some direction in that vector space which runs “good/bad”, and pushing the model in the direction of any particular “bad” thing makes it shift along this axis.
Here’s my question – if emergent misalignment is actually a real and robust phenomenon (and omitting this obviously very necessary caveat from now on), how seriously should we take it? Should we take seriously the possibility that the model is literally extracting the true essence of goodness and badness?
And if not, why not? By fitting the hyperplane in the vector space, the model is telling us that groups of concepts (or groups of tokens which we associate with concepts) are in some way near to each other, and that they are aligned so as to group ones that we universally recognise as “nice” in one part of the space, ones we universally recognise as “bad, taboo” in another part, and other more ambiguous concepts and tokens are spread out in some meaningful way between the two paradigmatic ends of the spectrum.
It seems to me that this is exactly what modern analytic moral philosophy does, but systematised and industrialised. The machine is looking at our ordinary use of language (check), refining the underlying structures which govern our application of evaluative words (check) and attempting to fit these structures together into a non-contradictory and logical schema, as efficiently and parsimoniously as possible. This is ordinary language philosophy and moral intuition, writ large.
Industrialisation, of course matters, and we can’t necessarily be sure that a systematic and algorithmic fitting of structures to tokens is a better insight into ultimate reality than simple introspection.But could they be doing the same thing?And, for example, if you could look at the algorithm and see that one of your own ethical views was actually grouped closer to ones that you found abhorrent, would this be more or less likely to change your mind than reading a survey showing that the majority of voters disagreed with you?
[1] Remember my patented method for visualising a hyperplane in a vector space; think of a normal two dimensional plane in a three dimensional space, while muttering “hyperplane hyplerplane hyperplane” to yourself.
