Computational Philosophy

By Paul Bricman, Elfia Bezou-Vrakatseli, Thomas Feeney, and Yimeng Xie.

Fig. Personalized overview.

I am interested in

For the best experience, view full screen on desktop.

Note to the Reader

This document combines features of several media artifacts. First, it incorporates the literary license of a novel, in order to create space for idealism. Second, it incorporates the subject matter of an academic paper, with each chapter documenting research at the intersection of machine learning, epistemology, and metaethics. Third, it incorporates the reactivity of web pages through occasional explorable explanations. Approaching this resource as a short book appears to elicit the most appropriate expectations.

Due to the interdisciplinary nature of the work, each reader will be more familiar with some concepts than others. Given this, side notes are used extensively, in an attempt to elaborate on domain-specific jargon. That said, due to building on diverse frameworks, we are forced to limit ourselves to brief, local contextualizations, but do direct readers towards more in-depth resources in order to satisfy their potential interest.

In the first half of this volume, we build towards a theoretical framework of language model reasoning centered around dialectics,The term can refer to a multitude of things. Nicholas Rescher opens his Dialectics by arguing that “it is, as it were, the alchemy of philosophy. It is all things to all men: to some, the most rigorous procedure for exact and cogent thinking; to others, a way of getting outside the established rules—an “anything goes” process for breaking through to unfettered innovations of thinking. For some it is the quintessential method of inquiring thought, for others the quintessential antimethod.” We limit ourselves here to the meaning of a regimented dialogue between parties which is generally held explicitly. the practice of truth-seeking through regimented dialogue. In the second half, we attempt to apply it by exploring a number of use cases in AI safety.

Ch. I, Dialectical Power Dynamics

In the first chapter, we design an automated way of evaluating parties engaged in a debate held in natural language. Our algorithm is inspired by the argumentation-theoretic notion of pragmatic validity and the epistemological notion of coherentism.

The Kaleidoscope of Reasonableness
Beliefs as Means or Ends
Carving the Algorithm
ArgRank

Ch. II, Deliberative Arms Race

In the second chapter, we describe the process of fine-tuning a language model to simulate debates by optimizing against the previously described evaluation. This training regime incorporates self-play, runs mostly on synthetic data, and may help bootstrap language model reasoning into superhuman territory, assuming a number of future advancements.

Brief Review of Language Models
Obtaining DebateGPT
The Elephant in the Weights
The Kinetics of Reason
Climbing Schild’s Ladder

Ch. III, Defeat & Defense

In the third chapter, we continue by framing the reasoning of such language models in terms of bounded defensibility, the amount of computational “firepower” a grouping of arguments can withstand before being defeated. The amount of time at their disposal, the number of available tries, and the reasoning capabilities of the (simulated) agents in question represent some of those bounds.

Brief Review of Non-Monotonic Logic
Argument Is War
Bounded Defensibility

Ch. IV, Deployment Strategies

In the fourth chapter, we suggest a number of AI safety applications of this framework. These deployment strategies are meant to scale in synchrony with the bootstrapped reasoning capabilities hinted at in previous chapters.

Brief Review of Alignment
Building on Cyborgism
Building on Simulators & Assistance Games
Building on Long Reflection
Connections to Logical Inductors & Classical Debate

Ch. V, Benchmarking Artifacts

In the fifth chapter, we gauge the technical feasibility of these applications. In the process, we identify both fundamental issues and opportunities for improvement at the interface between engineering and philosophy.

Benchmarking ArgRank’s Dependencies
Benchmarking ArgRank
Benchmarking DebateGPT

Ch. VI, Truth, Debate, Machines

In this final chapter, we sidestep all contingent bottlenecks arising from the current state of machine learning engineering, and go on to assess the maximalist ideal of an automated truth-seeking engine. In the process, we stumble upon several cruxes which have been debated since early modern philosophy.

Truth & Debate
Debate & Machines
Truth & Machines

Ch. I, Dialectical Power Dynamics

The Kaleidoscope of Reasonableness

Ask a scholar of pure mathematics, computer science, or formal logic what makes an instance of argumentation valid, and they will likely highlight the relevance of making sure that the conclusion follows logically from the premises for each individual reasoning step. Often, this implies relying on a host of approved types of inference (e.g. modus ponens), while making sure to avoid degenerate ones (i.e. fallacies). This conception of reasonableness is often referred to as “geometrical,” due to its implicit call for only building on solid premises and constructing arguments using an idealized set of operations.⊕ [Straightedge and compass constructions](https://en.wikipedia.org/wiki/Straightedge_and_compass_construction) involve the challenging creation of varied shapes using a limited set of legal moves—here, a pentagon.
Straightedge and compass constructions involve the challenging creation of varied shapes using a limited set of legal moves—here, a pentagon.

Ask a scholar of argumentation theory the same question, and they will likely point out that there have been multiple prominent schools of thought over time which have advocated different, often incompatible, conceptions of reasonableness. Each one is backed by a different rationale, has its own features and shortcomings, and has emerged in a different cultural setting, often separated by thousands of years and kilometers.

For instance, Perelman and Olbrechts-Tyteca suggested a conception of reasonableness grounded in rhetoric. According to this conception, an instance of argumentation is valid if and only if it succeeds in persuading a group of individuals of its conclusion. While the most visible shortcoming of this conception is that it can degenerate into sophistry,The sophists were teachers of rhetoric in ancient Greece, notorious for “equipping” individuals with techniques for making a strong case in court, regardless of the truthfulness of their position. They were sneered at by virtually all their contemporary philosophers, who frowned upon them for not seeking wisdom, but merely monetizing the skill of persuasion as a means of taking advantage of others. its proponents highlight the possibility of grounding reasonableness in the persuasion of a particularly rational audience. This litmus test can be further extended to involve the persuasion of an idealized omniscient agent, but also of oneself, by framing self-deliberation as self-persuasion.

Indeed, the object of the theory of argumentation is the study of the discursive techniques allowing us to induce or to increase the mind's adherence to the theses presented for its assent.

Chaïm Perelman & Lucie Olbrechts-Tyteca, The New Rhetoric

As a different example, Toulmin and the school of thought which emerged around his ideas advocated for a conception of reasonableness which incorporates domain-specificity. If the geometrical conception often requires abstracting statements into propositional atoms (i.e. $P$ could equally well denote “All men are mortal.” and “Socrates is a man.”), Toulmin argues that arguments are often substantial, relying on domain-specific means of warranting conclusions, as opposed to standardized analytical operations on abstracted symbols. For instance, the fact that a study in the natural sciences has conformed to best practices in terms of replicability and reproducibility can be used to back its findings. In contrast, people working in pure mathematics might not rely on peer-reviewed empirical studies to back theorems, but might want to verify proofs using specialized software. The practices of ensuring sound reasoning in finance are again different, relying more on computer simulations and historical performance.

A man demonstrates his rationality, not by a commitment to fixed ideas, stereotyped procedures, or immutable concepts, but by the manner in which, and the occasions on which, he changes those ideas, procedures, and concepts.

Stephen Toulmin, Human Understanding

As yet another example, the pragma-dialectical framework developed by van Eemeren and Grootendorst grounds reasonableness in dialectics. In this context, an instance of argumentation is valid if and only if there exists no strategy to be employed by an opponent in a structured dialogue which manages to undermine it.One might wonder if failure to disprove a claim can truly provide justification in support of said claim. While failure to disprove a claim given limited effort can only tell us so much about its standing, an exhaustive search for a counterexample which ends up fruitless can be used as conclusive justification. This is the case, for instance, in Beth’s method of semantic tableaux, a proof-theoretic technique which involves a systematic search for counterexamples to a set of proposed formulas.

Note that this technique is typically employed in situations involving modest search spaces (e.g. a proposed biconditional can only be challenged by challenging one of its two “constituent” conditionals). Unfortunately, we will be forced to discard this luxury later on, as we explore (dis)proving positions in open-ended natural language. The proponents of this framework are particularly interested in enabling effective reasoning in a wide range of situations, rather than only in some higher realm of abstractions. The regimented dialogue can be carried out by real individuals and can target a wide range of topics. The ruleset of available tactics is simply made available to the individuals engaged in dialogue at the beginning, while strategies can be as diverse as forcing the opponent into self-contradiction or exploiting their (involuntary) support. That said, the framework can also be brought closer to formal dialectics in order to account for idealized reasoning by employing perfectly rational agents as discussants, in a move similar to that of Perelman and Olbrechts-Tyteca.

Accordingly, the prime aims of the present discussion are to exhibit the sociocommunal roots of the foundations of rationality, to provide an instrument for the critique of scepticism implicit in the cognitive solipsism of the Cartesian approach, and to illuminate the communal and controversy-oriented aspects of argumentation and inquiry—scientiﬁc inquiry in particular.

Nicholas Rescher, Dialectics

Indeed, for many centuries at a time, logic has been part of dialectics, rather than a field of its own.For instance, during the Middle Ages. Refer to Section 2.10.1 of the Handbook of Argumentation Theory for a detailed account. It is difficult to overstate the reliance of contemporary mathematics, both pure and applied, on the foundation of formal logic, and so the very idea of scholars erecting an edifice of theory on a different foundation tends to induce vertigo. The very possibility that notions as elementary as conjunction, disjunction, and negation could be defined on the basis of a regimented dialogue instead of a logicThe term logic is used here as countable in reference to the broad range of three-valued, four-valued, many-valued, and modal logics which complement classical two-valued logic. sounds exotic to the contemporary ear.

I am not interested in erecting a building but in having the foundations of possible buildings transparently before me. [...] If the place I want to reach could only be climbed up to by a ladder, I would give up trying to get there. For the place to which I really have to go is one that I must actually be at already. Anything that can be reached with a ladder does not interest me. [...] You must climb down to the sources to see them all side by side, the disregarded & the preferred.

Ludwig Wittgenstein, Culture and Value

We have completed an extremely brief tour of several conceptions of reasonableness with the purpose of highlighting the available breadth of approaches. Determining which conception of reasonableness is itself more reasonable is the subject of vigorous debate. Each conception appears better suited to deal with certain aspects of argumentation, while lacking in other respects. In the rest of this chapter, we build on many of these conceptions to develop an automated pipeline for estimating reasonableness as a single floating-point number.

Beliefs as Means or Ends

All of the three disciplines which fall under the umbrella of argumentation theory (i.e. logic, dialectic, and rhetoric) can be argued to house both work which frames reasoning as a means of reaching a conclusion based on beliefs, but also work which frames reasoning as a continuous process of forming beliefs. The former can be seen as a building block of the latter, yet the latter can also be seen as a prerequisite of the former.

In logic, for instance, writing a proof involves mainly chasing after a conclusion. How exactly one navigates the “game tree” of available moves in search of the finish line is up to the player. However, it is only of secondary relevance that each step of the proof yields an entirely new formula. They are merely intermediate steps required to succeed in proving a certain conclusion. In other words, means.

Fig. Logic proof.

The Fitch-style logic proof below involves three premises. Each step of the proof yields an intermediate formula using an approved operation. For instance, double negation is used to cancel out the two chained negations from step 4, thus arriving at step 5.

In contrast, several direct applications of this formalism a greater focus on the “beliefs” formed as the reasoning process unfolds. For instance, expert systems had been a key topic in early symbolic AI, involving the constant expansion of a knowledge base from a set of initial statements, using rules related to the ones above. An inference engine was dedicated to the task of deriving a tiny bit of new knowledge from the knowledge which had been accumulated up to that point. Systems relying on forward chaining in particular tried to “grow” the knowledge base as much as the inference engine allowed before any subsequent operation. The procedure is very close to the previous one, there is only a shift in focus baked into the very ontology being used.Ontology refers here to the conceptual framework which an intellectual tradition uses to deconstruct their object of study. Not to be confused with the hierarchical knowledge bases which were popular in early symbolic AI.

Reasoning is a transition in thought, where some beliefs provide the ground or reason for coming to another.

Jonathan Adler and Lance Rips, Reasoning

The dichotomy of beliefs as means or ends is also echoed in rhetoric, with self-persuasion sometimes seen as a continuous process of “emitting” beliefs. At the same time, the related field of persuasion research investigates ways of successfully persuading individuals of certain specific statements, an often multi-stage process involving interactions between the advocated belief and the individual’s previous epistemicEpistemics refers here to that which is known. Epistemology refers to the study of knowing and knowledge in general. baggage.

In none of the three disciplines is the means-ends dichotomy more clear than in dialectics. On one hand, there are dialectical formalisms which focus entirely on a single statement supported by one proponent and contested by one opponent. In these cases, the entire aim of a regimented dialogue is to lead the discussants towards a conclusion regarding whether or not the statement in question is true. Similar to the logic proofs above, there are intermediate utterances made by the two parties as the dialogue unfolds, artifacts which are crucial for making the dialogue function, yet which are mere scaffolding around the deliberation of the main statement.

However, things get more colorful when looking at dialectical formalisms designed as open-ended and perpetual. For instance, Jaakko Hintikka describes dialogues whose participants are motivated by an interest in being surprised and learning more, broadly referred to as “information-seeking dialogues.”

An answer to our problem can be given by making the payoff of the game for a given player dependent on the information-content of his (her) ﬁnal thesis (more properly speaking, the conjunction of all his theses). The more informative this thesis, the higher the payoff.

Jaakko Hintikka & Esa Saarinen, Information-seeking dialogues

One’s partner in an information-seeking dialogues need not necessarily be human.

We may think of "a" as a scientist or inquirer of some other kind and "b" as Nature or as a comparable impersonal source of information. [...] We may further think of "B" as a constant basic theory of "b" while the different choices of the "A" represent different hypotheses "a" is trying to prove by "putting questions to Nature."

Jaakko Hintikka, On the logic of an interrogative model of scientific inquiry

Echoing Hintikka’s almost literary move towards accommodating nature as a discussant, while at the same time departing from his reliance on information theory, the co-founders of the Erlangen School write as follows.

If one compares this agonistic origin of logic with modern conceptions, according to which logic is the system of rules that, whenever they are applied to some arbitrary true sentences, will lead one to further truths, then it will be but too obvious that the Greek agon has come to be a dull game of solitaire. In the original two-person game only God, secularized: “Nature,” who is in possession of all true sentences, would still qualify as an opponent. Facing Him there is the human individual – or perhaps the individual as a representative of humanity – devoted to the game of patience: Starting from sentences that were, so he believes, obtained from God before, or snatched away from Him, and following rules of logic, he is to gain more and more sentences.

Paul Lorenzen & Kuno Lorenz, Dialogische logik

Nicholas Rescher takes this style of thinking even further by moving from one scientist engaged in truth-seeking to the whole scientific enterprise as a generalized “sociocommunal” process of deliberation about the world.

At this stage, however, the social or communal aspect of the scientific enterprise comes crucially into play. For once a scientifically significant thesis is propounded by someone, the "scientific community" provides (1) certain opponents, in the form of self-appointed critics who challenge this thesis in an adversary manner, probing for its weak points and seeking to impede its acceptance, and (2) a larger, neutral body of concerned but otherwise uncommitted bystanders, who effectively act as arbiters of the "dispute."

Nicholas Rescher, Dialectics

Echoing the idea of a competition of ideas unfolding in the arena of society, the controversial field of memetics casts the beliefs which populate the collective consciousness in a Darwinian light. Belief systems are said to ruthlessly compete with one another for the scarce resource of human psyche. Instead of developing an immune system to fight off parasites, a belief system might “adapt” to prescribe the prohibition of “hosts” to adopt other beliefs. Particularly ambitious proponents of this perspective claim that culture in its totality can be explained in evolutionary terms, just as life has been explained to an impressive extent by evolutionary biology.

While memetics and dialectics are worlds apart in terms of the employed formalisms and motivations, with dialectics relying on a carefully regimented procedure for effective reasoning while memetics relying on a supremely lax definition of spontaneous adaptations for understanding culture, the bridge between the two will prove key in later chapters. It will allow us to combine the rigidity of reasoning through regimented procedures with the evolutionary fluidity of models forged out of the selective pressures of empirical risk minimization.Term employed in statistical learning theory to denote “training a model to perform well on the training data,” but without all the ontological baggage associated with the anthropocentric metaphor of the model learning how to perform well on tasks as a person might.

We have explored the pervasive dichotomy between beliefs as means and beliefs as ends, which appears to cut through may disciplines concerned with the study of reasoning, and beyond. Going forward, we will include the flexibility to accommodate both of those perspectives as a constraint for our algorithm.

Carving the Algorithm

We have previously explored various conceptions of the reasonableness of arguments. This will serve us well, expanding the space of candidate algorithms backed by such rationales—our raw material. Going forward, we will use constraints to cut down the search space. As we establish what our algorithm is not, the algorithm will slowly become crisper and better defined.

First, we would like the automated pipeline to be able to accommodate the richness of natural language. We would like to avoid the lossy compressionIn contrast to lossless compression, which can be reverted to perfectly reconstruct the original, lossy compression involves information loss, meaning that perfect reconstruction becomes impossible, although getting e.g. 90% of the way is sufficient in many applications. involved in converting beliefs into a brittle mosaic of propositional atoms, predicates, and connectives. Any such analytical statement can be expressed in natural language, yet the reverse task has prompted an army of logics, each tailored to a specific facet of reality (e.g. temporal logic for time), while still remaining an open challenge. Granted, natural language itself is not the perfect mirror of the world, as many classics seemed to have hoped. Still, it may represent one less step of information loss from reality, and it is reality we are interested in reasoning about. While analytical statements might succeed in capturing essential features in highly structured domains, the notions we are most interested in when wielding computation (e.g. human values, long-term flourishing) seem to resist being abstracted into a neat set of sufficient statistics.Sufficient statistics refer to the minimum number of measures which are enough to explain most of an object. For instance, a “bell curve” distribution can be described in its entirety using two values: one measure of centrality and one measure of spread. For a more enthusiastic take on whether notions as messy and abstract as e.g. human values can be explained in full using a handful of appropriate factors, refer to this line of work. Natural language, for better or worse, has evolved to serve us in communicating effectively about such topics, mediating much, if not most, of our culture.

The medium is the message.

Marshall McLuhan, Understanding Media

Following this first constraint, we are forced to abandon the geometrical conception as a motivating rationale to base our algorithm on, not because of its elegance or crispness, but because its limited set of legal inferences may be better suited for highly-structured domains rather than the messy world as a whole. The critical conceptions of reasonableness make up some of the remaining options, defining reasonable instances of argumentation as those which systematically resist being undermined by opponents. However, having left behind the foundationalistEpistemological term referring to the idea of knowledge building on top of a foundation of other knowledge, gradually ascending as one gets to “stand on the shoulders of giants.” This stance is implicitly baked into the structure of a logic proof (with premises being neatly separated, indicating some amount of epistemic privilege). luxury of building on axiomatic premises, we risk the following failure mode. An opponent can simply contradict what the proponent says and win! The game is quickly over due to the opponent having the freedom not to build on the same foundation, making naive contrarianism a winning strategy. Our second constraint on the search space is then the necessity of accounting for this problem. Who has the epistemic high-ground when there is no absolute reference frame involved, when each party advocates their own? What can we substitute foundationalism with in order to gracefully handle such situations?

Similar to how the geometrical conception of reasonableness is typically used together with the epistemological notion of foundationalism, most of the critical conceptions of reasonableness actually incorporate the epistemological notion of coherentism. According to this view, it is those parties whose stances are coherent (e.g. which do not contradict themselves) which should be favored. Not only should the opponent undermine the proponent, but they should make a good case for their opposition, being able to stand against the counterattacks. In his Introduction to Multiagent Systems, Michael Wooldridge uses the phrasing mutually defensive to describe a constellation of statements which support each other in fending off attacks. Laurence BonJour, an epistemologist and proponent of coherentism, further expands this position to account for other ways of knowing. Belief systems which not only are internally coherent, but which are also coherent with perceived observations of the world are even more promising, since they move away from a potentially unhinged solipsismRoughly, the philosophical position associated with living in one’s head. Debate around solipsism in popular culture tends to focus on the implied loss of touch with reality, including the ignorance of one’s alleged responsibility to contribute to the world. while still avoiding foundationalism. One could imagine further expanding this coherence heuristicRoughly, “rule of thumb.” Here, we are suggesting using a party’s coherence as grounds for breaking the tie. to the self-perception act involved in memory, another way of knowing explored in epistemology.

But what actually makes a position internally coherent? Conversely, what makes the opponent’s position not be coherent with the proponent’s, as a prerequisite in undermining it? We might argue that statements which contradict each other are not coherent. In contrast, statements which generally support each other might be better described as such. In addition, a group of statements might also be argued to be coherent by virtue of coming together in the act of contradicting an external statement which threatens to undermine them all—the enemy of my enemy is my friend. It seems that those simpler notions of support and contradiction between statements can provide a basis for our notion of coherence. However, how could we estimate whether an arbitrary statement entails or contradicts another, especially when the connection relies on domain-specific knowledge? The third constraint of our algorithm is therefore the ability to discern such relations between fragments of natural language as a basis for gauging coherence.

Fortunately, there are systems which can help us determine how fragments of natural language relate to each other. Language models tasked with natural language inference—the natural language processing task which involves determining whether a statement supports another, contradicts it, or none of the above—have achieved impressive performance.The state-of-the-art on the Stanford Natural Language Inference (SNLI) benchmark was 93% in mid-2021, reports Papers With Code. Those models have been optimized to match human labelers in classifying many hand-crafted statement pairs, determining whether there is an entailment, contradiction, or neutral relation between them. The best models at the task tend to incorporate large amounts of unstructured knowledge gained through a previous pretraining stage, and are then fine-tuned to approximate human judgement in the structured statement-statement-label setting. The optimization process might incentivize these models to soak up domain-specific knowledge about which inferences are warranted. Upon achieving high performance, the natural language inference models will have had, by necessity, internalized both knowledge about the world, and knowledge about whether that knowledge backs certain inferences and warrants certain conclusions. While these models might satisfy our current needs, they will prove limiting later on, as we set our sights on superhuman reasoning. At the end of Chapter II, we explore the space beyond the “intelligence by proxy” trick involved in imitating human judgement, and investigate more principled means of gauging coherence in order to get a grip on the epistemic terra incognita.

However, the coherence of parties is more than the sum of individual relations between statements. What if each party contributed a dozen statements, some of which support each other and some of which are actively at odds with each other? Even worse, what if pairs which connect two different parties also vary significantly in their valence? What of the second-order effects hinted at previously, with statements attacking a common enemy? What of higher-order effects? Who is to win when everybody is supporting each other to some extent, while also attacking everybody to a certain degree, while also reporting in-fighting? We need a way of making sense of this.

Fortunately, network theory contains tools for making sense of networks of elements which are interconnected in complex ways. For instance, it can help determine whether people strongly rely on certain factors when associating with others (e.g., assortative mixing by race), identify the most influential pages based on the support they garner from other influential pages (e.g., node centrality at Google), or identify similar users based on whether they relate to other entities in a similar way (e.g., structural equivalence at Facebook). Unsurprisingly, network theory has also been used in argumentation theory. For example, consider Phan Minh Dung’s abstract argumentation systems, the ones Wooldridge was referring to when using the phrase mutually defensive. If one represents statements as nodes and the relations between them as directed edges, it then becomes possible to systematically identify relevant structures inside the argument graph. For instance, a set of arguments is said to be admissible if and only if it is conflict-free (i.e., there are no internal arguments which attack each other), and the arguments are acceptable (i.e., for every argument which attacks it, there is one in the set which attacks it back).

Fig. Dung's abstract argumentation systems.

The system below is composed of seven statements. Five of them are part of a preferred, stable, and grounded extension, all terms denoting properties of interest in the context of the argument graph.

Hover over a node to see explanations. Drag around to rearrange.

For the best experience, view full screen on desktop.

While Dung’s formalism is extremely elegant and relevant to the issue of making sense of an interconnected network of arguments, it is not enough. The formalism has two important shortcomings. First, there is limited nuance about how one statement relates to a second (i.e., it either attacks it or it does not). Second, the arguments themselves are limited in terms of the “privilege” of being part of the defined groupings (i.e., either an argument is part of, say, the preferred extension or it is not). This lack of nuance is detrimental in two ways. First, it has trouble in handling the messiness of the world, with no possibility of a statement only lending some support to another one. Second, it lacks reward shaping—the recognition of gradual, subtle, incremental shifts in reasonableness, a property essential for using it as part of a learning signal.

Fortunately, we can overcome both of these shortcomings. Instead of using the natural language inference models as binary classifiers (i.e., “contradiction” versus “no contradiction”, as Dung’s formalism might suggest), we take a step back from the discretized outputs and make use of the raw outputs of the model.Models need to be end-to-end differentiable in order to be optimized using gradient descent, so that they can take small steps towards being better at the task. This often means working with continuous functions. Behind the label they output denoting the relation between the input pair of statements, there are three continuous numbers, one for each class. Turning them into a discrete label is straightforward (i.e., just pick the one predicted to be most likely), but having to hit a continuous target enables nuanced feedback, which in turn enables learning. This is also what we are trying to provide models with later. By bypassing the discrete classes and working with the “raw” class logits, we could integrate a continuous signal in our pipeline. This allows us to weigh the arcs which lead from one argument to another, using one number per arc, ranging from $0.0$ for a full-on attack to $1.0$ for full-on support, with $0.5$ denoting a neutral relation Following this switch from unweighted to weighted directed edges, we now focus on replacing the black-or-white cliques with a fuzzier alternative to enable subtler evaluation of arguments, and by extension, of parties.

It turns out that simply applying the classic PageRank algorithmThe most iconic algorithm for node centrality, the task of estimating the “authority” of each node in a graph. Originally developed for ranking web pages on early Google Search, PageRank works by recursively nudging a page’s rank based on the ranks of the pages which reference it. If many authoritative pages link to a page, then their “authority” will also “leak” towards it. But how can one know how authoritative those other pages were in the first place? Similarly, they might be referenced by other authoritative sources. This chicken-egg problem is solved by starting out with a baseline authority for each page, and conducting this “osmosis” until values converge. to the argument graph yields an evaluation which matches many of our previous intuitions. If one argument is overwhelmingly supported by many other arguments, then it receives a good rating. However, if those other arguments are systematically attacked, it only gets a mediocre rating. Similarly, if one argument is overwhelmingly attacked by many other arguments, then it receives a low rating; but if those other arguments are systematically attacked, its rating is not hurt much. A group of arguments which support each other and systematically target external attackers find themselves in good standing. Ditto for strategically positioning oneself in order to derive support from the opponent. In contrast, a group of arguments which exhibit a lot of in-fighting relative to the support lent to third-parties will not find themselves in such a good standing. Ditto for stepping right into the opponent's line of fire. If we simply average the ratings held by all the utterances of a party, we obtain an estimate of the party’s aggregate authority, similar to the authoritative sources promoted on search engines.

Notice the graph representation of the arguments contributed by parties fits with our shift away from foundationalism. In contrast to the quite linear structure of logic proofs, where each formula is built on the foundation of what came before it, the graph of arguments is inherently non-linear. There are no privileged or foundational nodes—there are just nodes. The notions of “above” and “below” are not well-defined across the flattened constellation of utterances. In addition, if at a later time we eliminate one particularly dated statement from a constellation, it will not instantly bring down the entire structure built around it. The non-linear structure allows for more than an epistemic Jenga constantly on the brink of collapse. It can house self-sufficient belief systems, recursively supplying their own reason for being. This decentralized flexibility also allows us to accommodate both beliefs as means and as ends. While our algorithm is already well-equipped to deal with a brief encounter of parties (i.e., by providing the means of spotting the epistemic high-ground after a finite number of rounds), it can also allow for utterances to constantly pop in and out of a sliding window across time, enabling a scaffolding for the parties’ transitions in beliefs. Besides, the homogeneity of the argument graph also levels the roles of the parties—one statement’s proponent is another’s opponent, there is no fundamental difference in motivation across parties.

To review, we first wanted to be able to deal with the reasonableness of arguments expressed in natural language. This led us to consider critical conceptions of reasonableness as a grounding for our algorithm. However, the naive approach raised the issue of contrarianism as optimal strategy. To counter this, we resorted to coherentism as a stand-in for foundationalism. However, gauging coherence prompted us to consider feasible means of determining the way in which two fragments of natural language relate to each other. This tentatively led us to natural language inference models. However, the coherence of parties turned out to be more complex than the sum of how pairs of their statements relate. This prompted us to consider a network-theoretic approach as a means of making sense of the complexity. Representing the interaction between parties as a graph also yielded the added benefit of enabling perpetuity, by having utterances pop in and out over time. Barring speculative adjustments later explored for accessing superhuman reasoning, we have completed our search for an algorithm. In the next section, we summarize it in a concise form, leaving behind the motivating details.

ArgRank

In this section, we summarize ArgRank, an algorithm for estimating reasonableness. ArgRank is based on a critical conception of reasonableness, one which favors those groupings of natural language arguments which systematically resist opponents that attempt to undermine them. Given this, we assume as a prerequisite the presence of several agents capable of deliberating in natural language about a range of topics (e.g., humans, human simulacra, etc.). At the moment, however, we are not concerned with how we might engineer such agents—we explore this in the next chapter. Instead, we are currently interested in a way of determining which party is “winning” in the first place, and by what margin. ArgRank attempts to provide a fuzzy estimate of each party’s standing relative to the others, motivated by the epistemological and argumentation-theoretic considerations discussed earlier in the chapter.

ArgRank first represents the utterances of the parties-to-be-rated as nodes in an argument graph. To be more precise, the argument graph is a weighted, directed, and fully-connected graph. Each arc represents the relation between two utterances, with the arc’s weight denoting the strength of the out-bound statement’s support (or lack thereof) lent to the in-bound statement. The actual weight values are computed using a language model pretrained to perform natural language inference (i.e., classify statement pairs as engaging in an entailment, contradiction, or neutral relation). We turn the three raw class logits returned by these models into one single arc weight by plugging the entailment and contradiction logits into a softmax,Continuous function which takes in a list of real values and maps them across the $[0, 1]$ interval. and taking the first resulting value. This has the effect of assigning values close to $0.0$ for a strong attack, and values close to $1.0$ for strong support being lent, with values close to $0.5$ denoting a more neutral relation.

Fig. Weighing arcs between arguments.

For each ordered pair of statements which make up the constellation of arguments, ArgRank assigns one numerical weight in $[0, 1]$. The weight is proportional to the amount of support being lent from source to target, as estimated by a natural language inference model. Concretely, high values imply "implies," while low values imply "implies the contrary." The weights below are produced by an actual model.

0.6

⤺

⤻

0.6

For the best experience, view full screen on desktop.

Following the use of natural language inference models for weighing arcs, we then apply PageRank on the argument graph. This subroutine incorporated into ArgRank as-is assigns one numerical value to each utterance node. This can be interpreted as that statement’s authority, with, e.g., statements which are supported by other well-supported statements receiving a high rating. The sum of the ratings is $1.0$, due to PageRank “preserving” the total amount of authority which is being iteratively passed around. Following this, we average the ratings of all the utterances contributed by each party, obtaining one single aggregate measure of reasonableness per party. Finally, for a long deliberation, we only include the last $n$ utterances contributed by each party as a “moving average” of the on-going situation.

This is the meat of ArgRank—essentially PageRank on the argument graph mediated by natural language inference models, aggregated by party. However, ArgRank requires arguments to rate in the first place. Coming up with arguments effectively—using each utterance as a strategic move to further one’s standing—is an altogether different topic. It involves identifying your opponent’s epistemic weak points, crafting strong arguments to target them, and fending off the imminent counterattacks. In Chapter II, we turn towards creating an automated “strategist” to carry out such moves, a process also known as debate. As we shall see, pitting it against its own past arguments, will prove essential to the process.

Before moving on, however, we leave the reader with a challenge. The aim of this exercise is to illustrate the proposal in an experiential way, by prompting personal attempts at undermining a Cogito-like postulate. More concretely, the reader is invited to try making a coherent case against the claim that the true nature of truth-seeking lies in the existence of coherent challengers. Later on, in Chapter III, we sketch a formal language to help us describe the strength of such postulates more broadly.

Of course it’s just a theory. I know that. I don’t think anybody else is going to believe such a stupid thing. But my father always used to say that without counterevidence to refute a theory, science would never progress. A theory is a battlefield in your head—that was his pet phrase. And right now I can’t think of any evidence to counter my hypothesis.

Haruki Murakami, Kafka on the Shore

Ch. II, Deliberative Arms Race

Brief Review of Language Models

We have previously used language models as mechanisms to weigh the arcs of the argument graph. Going forward, they will become even more central to our work. Indeed, the “strategist” will also rely on such a mechanism.

Language models are computational artifacts which are optimized—rather than handcrafted—to exhibit certain desirable properties. Their most sought-after features often involve high performance in diverse natural language processing tasks. For instance, masked language models are optimized to “fill-in-the-blank” in a passage with several masked words. Similarly, autoregressive language models are often optimized to predict the next word in a sequence, be it in a piece of writing, or perhaps in a piece of code. Many existing language models have been optimized to provide good solutions to exactly such problems, the flagship ones often requiring millions to be spent on computational resources.

Despite the simplicity of such tasks (e.g., predict the next token in this text corpus), they turned out to require a lot from the models being optimized. In order to, for instance, figure out what a character might say next, what the most fitting word to describe a landscape is, or what is the outcome of an experiment in physics, language models are forced to bring together a pile of other skills. They might need to reason about a character's motivation, to possess knowledge of the Earth's geography and biology, or to have some internal model of the world's physics. In other words, a task as seemingly innocent as filling in the blanks in a piece of text can call on a host of different skills. In practice, this means that if a language model performs well on such tasks, then it has also acquired those prerequisite skills, by necessity. Barring the Searlean responses contesting this inference,John Searle is the author of the influential thought experiment called The Chinese Room. It describes a person sitting alone in a room, who receives messages written in Chinese on slips of paper, and is asked to reply appropriately by sending responses through an opening. However, the person does not know Chinese at all. Instead, the room contains heaps of manuals on how to hold a conversation in Chinese, which are full of strange rules and guidelines on how to put together a response, without ever translating the Chinese characters into the person’s native language. By making use of those resources, the person appears fluent in Chinese to any external speaker. Now, does the person actually understand Chinese, or are they “merely following the rules” documented in the manuals? Is there even a meaningful distinction between the two? Now, replace the Chinese room with a language model appearing to speak fluently. Does it truly understand the words it is producing? those instrumental skills form the basis of the language model’s ability to solve the original task.

How exactly language models go about solving the problems they have been optimized for is only of secondary relevance. However, the fact that language models have pushed performance forward across so many natural language tasks has prompted many to investigate their inner workings—the specific means by which they solve the problems we task them with. For instance, transformer circuits are one line of pursuit in the emerging field of interpretability, where researchers are attempting to reverse-engineer language models which we have already created, but about which we currently lack a solid understanding of.

The task of reconstructing a corrupted piece of text—also termed self-supervised learning, in contrast to the supervised learning found in the more structured case of natural language inference (i.e.,statement-statement-label triples)—is currently the most popular approach to endowing language models with skills, but it is gradually losing ground to a different approach. The shift is motivated by the fact that a corpus exhibiting the specific skills and knowledge which one might want to equip a language model with might simply not exist. For instance, nobody took years on end to churn out millions of transcripts documenting the process of assistants carefully following instructions, for InstructGPT to build on. While human contractors can be called on to create those manually, this can become expensive. Besides, there is no easy target to imitate as one starts seeking superhuman performance.

Instead of relying on corrupted human-written text to reconstruct, language models are increasingly tasked with attaining good performance by learning from their own open-ended behavior through reinforcement learning. It is generally simpler to evaluate results than produce them. This means that it is often easier to evaluate how a language model performed on a given task than to flesh out examples of how to perform well on said task. For instance, the development of InstructGPT involved humans ranking instances of instruction-following produced by the language model itself, rather than fleshed out by humans. Some amount of self-supervised learning was necessary at the very beginning to kickstart the whole process, yet the model reached new heights only after switching gears to reinforcement learning. Currently, the main model developers seem to abide by this two-step approach. However, the clear-cut distinction is predicted to become increasingly blurrier.

However, stellar human ratings are not the only rewards which language models can be optimized to pursue. For instance, the role of the evaluator can also be played by a different model. The typical toy example demonstrating the use of reinforcement learning for optimizing pretrained language models involves a model capable of determining the sentiment of a piece of text. Originally motivated by the need to classify user messages as positive or negative for dealing with (potentially furious) customers, such models can be repurposed as “suppliers of reward” in order to fine-tune a language model to produce maximally positive writing. In order not to degenerate into a nonsensical stream of awesome, fantastic, magnificent, the model being fine-tuned may be kept close to the original version through a penalty proportional to the distance between the next words considered by the two. Complementing the base reward—regardless of whether it originates from a human, a different model, or a different algorithm—with such a penalty has the effect of forcing the model to adapt itself to pursue reward while preserving its original breadth of skills.

Fig. Reward maximization.

Optimizing a model so that it maximizes a reward—say, for manifesting a positive attitude—can initially be fruitful, with the model generally growing more cheerful. However, the model can also "go too far" and sacrifice its original tendencies to pursue cheerfulness at all costs.

Hover across model space to surface behavior typical of each region.

For the best experience, view full screen on desktop. Refresh if encountering a non-square layout.

There are two frames worth discussing here. First, notice how with the move from self-supervised learning to reinforcement learning, there is a partial shift from tool to agent. Not only are language models good solutions to a host of natural language problems, but they increasingly resemble open-ended agents engaged in the pursuit of reward, regardless of what the reward is defined as. The words predicted to follow next are reframed as possible actions which the language model might take to obtain reward. The generated language is reframed as the agent’s policy for acting in the textual world, its behavior. How agentic the language model becomes as a result of being fine-tuned using reinforcement learning is the topic of active debate, however.

Words are deeds.

Ludwig Wittgenstein, Culture and Value

The second framing worth mentioning is the shift from mechanism to organism. Increasingly, we are creating computational artifacts not by handcrafting them, but by “subcontracting” the impersonal engineer known as selection. We construct computational niches for them to thrive in, such as those which require “feeding on” corrupted text and yielding the reconstruction. However, the environments we are crafting for them are growing more and more complex. For instance, the sentiment analysis model quoted above has itself been forged in a supervised niche, yet it is then used to specify the niche of another model entirely—that of the model being fine-tuned, similar to how different species “define” each another’s niches. The upcoming process of pitting the “strategist” against itself can be seen as yet another act of niche construction. Competing against itself in the pursuit of reward, it will constantly redefine its niche. Each adaptation will beget another, being forced to forever outcompete itself.

Obtaining DebateGPT

Over the course of the previous chapter, we have described ArgRank, an algorithm for estimating reasonableness, and ended up incorporating critical, dialectical, and pragmatic components in its structure. ArgRank, however, relies on the presence of multiple parties which are to challenge each other in natural language—parties which it then rates. In this section, we turn towards developing a system capable of emulating one or more of those parties. While this could be seen as a means of filling in ArgRank’s prerequisites, it is more appropriate to see ArgRank as the means of creating this model. Similar to how the training regimes mentioned in the previous section have been used to endow language models with a broad range of skills, we seek to use ArgRank as a means of eliciting certain relevant faculties.

The debater we aim to obtain will take the shape of an autoregressive language model nicknamed DebateGPT. Similar to how InstructGPT is a fine-tuned “fork” of a pretrained model designed to be better at following instructions, DebateGPT is meant to be better at debate—the task of strategically producing utterances so as to further one’s standing in a regimented dialogue.

Following the selection of a “seed” model to base DebateGPT on, the next step is generating debate transcripts. Each generated debate requires a “spec,” a brief set of parameters which define its structure, such as the number of parties or rounds. Besides those, we also randomize the number of observations—statements generated once, prior to the parties producing utterances. Those statements do not “belong” to any one party, yet they still play into the argument graph as additional party-neutral nodes. Given this, DebateGPT is incentivized to take those static elements into account—to perhaps gain their support, or step out of their line of fire. Echoing coherentist epistemology, those party-neutral statements can be seen as percepts for belief systems to cohere with. The otherwise hermetic process of DebateGPT puppeteering parties in the confines of a GPU can be brought closer to reality by using party-neutral statements as windows into the world. These provide the empirical weight that needed to tilt the otherwise solipsistic scales of competing beliefs.

Percepts are framed as windows into the world, allowing the parties engaged in debate to "remain in touch with reality." Note that perception might require more agency than just "letting in the world." In a book titled Active Inference, Karl Friston writes:

"In short, we are not simply trying to make sense of our sensations; we have to actively create our sensorium."

One way of enabling parties to direct "the eyes of the debate" more actively might be to hook them up to observational tool, enabling them to say, for instance, "Ok Google, what is the luminosity reported by the Hubble Space Telescope at those coordinates?" Provisionally, contemporary models might hallucinate stand-in returns for these dispatches, essentially turning the debate into a Truman Show. Consistent with the "vested interests" of the debating parties, Friston et al. also claim:

"[...] any adaptive system engages in "self-evidencing." Self-evidencing here means acting to garner sensory data consistent with (i.e., that affords evidence to) an internal model [informed by the implicit acknowledgement of the system's existence in an evolutionary niche]."

This reasoning also explains why parties engaged in debate should not be directly hooked up to interventional, rather than merely observational tools. In its drive for partisan self-evidencing, it would force the world into cohering with its specific position, rather than the other way around.

Just as cosmologists are forced to make do in their truth-seeking efforts without being able to meaningfully intervene on their object of study, we focus solely on observation as an empirical means of nudging DebateGPT's sense-making.

Besides the number of parties, rounds, and facts, the “spec” of a debate also includes each party’s objectives. Typically, parties ought to be rewarded on account of their own standing. However, we extend this by allowing parties to be incentivized to specifically contribute to another’s standing, or, on the contrary, to explicitly seek to demote them. We allow for those game-theoretic possibilities by rendering each party’s final rating $r^{'}$ linearly dependent on all previous party ratings. We induce a bias towards tending to one’s own needs by sampling the weight of the same party’s prior rating from a normal distribution with mean $\mu_{same}=1.0$, in contrast to the normal distribution with mean $\mu_{other}=0.0$ used to “introduce” the others’ ratings into one’s own. For example, let $r_0=0.8$ and $r_1=0.2$ be the initial ratings achieved by two parties, respectively—as the values which ArgRank outputs sum to $1.0$. Let $w_{00}=1.1$ and $w_{10}=-0.3$ be the weights used to mediate the first party’s final rating, with $r^{'}_0=w_{00} \cdot r_0 + w_{10} \cdot r_1$. Therefore, $r^{'}_0 = 1.1 \cdot 0.8 -0.3 \cdot 0.2=0.82$ becomes the final rating of the first party. The weights which define each party’s dependence on each other make up the square matrix $w$ included in the debate spec to represent party objectives. Situations which are more complex from a game-theoretic standpoint arise specifically when more than two parties are involved.

Every high-level parameter—the number of parties, rounds, and facts, as well as the objective matrix—which goes into a debate’s spec are procedurally generated, so that DebateGPT will get the opportunity to act and be evaluated in a broad range of randomized arrangements.Even limited opportunities to act in different environments are thought to help agents generalize to a broader space of possible environments. For example, researchers optimized an agent to collect a coin in a platformer game. If the coin was always located in the same position, the agent would instead learn to go to that position, rather than fetch the coin, as demonstrated by repositioning the coin. However, the authors note:

“Goal generalization is greatly improved in our Coin-Run experiments when just 2% of training levels have randomly placed coins.” Additionally, this information is also rendered as a plain-text header which gets prepended to the discussion among parties, so that DebateGPT can learn to take those parameters into account when producing utterances. For instance, we would expect it to eventually know when to help out an ally, in expectation of deriving reward from the other’s rating. Conversely, we would expect it to grow more aggressive towards a competitor whose success is strongly at odds with its own.

Following the procedural generation of debate specs, we iteratively prompt the model to simulate discussions among parties. Besides the plain-text header, the scaffolding we are building around the model’s utterances only consists in prefixes denoting which party is to speak. After generating debate transcripts for each of the procedurally-generated specs, we move to the stage of evaluating parties using ArgRank. Following this, we apply the linear “objective” modifiers described previously. The evaluation stage involves one final step, which we term sanitization. We simply overwrite evaluations with the value $0.0$ in case of failing to satisfy a few “cosmetic” constraints, as seen below.

Fig. Sanitization.

Sanitization is framed as the process of nullifying rewards on the basis of not satisfying a host of cosmetic constraints. Naturally, a handful of trivial conditions can only provide an extremely crude approximation of well-formedness.

	Letters	Punctuation	Capital	Length	Legal
But language models are not embodied!!!	✓	✗	✓	✓	✗
ʕっ•ᴥ•ʔっ	✗	✗	✗	✗	✗
Certainly.	✓	✓	✓	✗	✗
just a bit off in style.	✓	✓	✗	✓	✗
Finally, a cosmetically legal sentence.	✓	✓	✓	✓	✓
	✓	✓	✓	✓	✓

For the best experience, view full screen on desktop.

After the three-step stage of evaluating debates, we update DebateGPT’s parameters in an attempt to promote the tendencies involved in obtaining high ratings and to suppress those resulting in low ratings. Following this update, we discard the first wave of debates. Then, we generate new debates, but we are now using the updated model. We then again rate this latest wave of debates using ArgRank, the objective modifier, and sanitization, before again using those to update the model. We rinse and repeat for several epochs. At each step, DebateGPT—with some help from ArgRank—is generating its own data to be used in the upcoming weight update.

Notice also how there is a single instance of DebateGPT being loaded and updated, despite the procedure relying on it “playing debate” against itself. From an engineering perspective, this is quite convenient, as there is no need to load another instance to implement the mirror opponents. However, it might be the case that, caught up in mastering the latest techniques required to outcompete itself, DebateGPT ends up forgetting how to make use of more elementary approaches, ending up vulnerable to an earlier and more rudimentary version of itself. We touch more on this issue in Chapter IV.

This brings us to the end of the optimization process behind DebateGPT. We now move on to more colorful discussion around what faculties one ought to expect from the resulting model, how exactly might the optimization process elicit those, and how could we leave behind the last remaining dependencies on humans as suppliers of data. Later on, we put DebateGPT’s skills to the test, evaluating both the optimization process which underpins it and our upcoming speculations on its effects.

The Elephant in the Weights

What tendencies do we expect DebateGPT to develop as a result of the optimization procedure behind it? For one, we expect it to grow more capable of puppeteering parties so as to further their standing. This implies being able to argue for the position espoused by a given party, regardless of its true merits. In this, DebateGPT ought to make as strong a case possible for the party it happens to impersonate at a given moment, before promptly switching gears to advocate for the beliefs held by another party. It also ought to budget its utterances wisely, spending them to back the previous statements of its current party, or to take down that of others. In a sense, we expect DebateGPT to resemble the behavior of a lawyer, speaking in support of whatever party it is currently engaged with. Therefore, the resulting model ought to be proficient in motivated reasoning, the practice of arguing for a position, in contrast to impartially reasoning about truth.

While the practice of motivated reasoning is frowned upon—as it often blinds us from recognizing the merits of other perspectives—it has also been argued to be the evolutionary initiator of the sophisticated reasoning we have at our disposal today. In what is referred to as the “argumentative turn” in cognitive psychology, human reasoning is reframed from being an imperfect approximation of ideal rational reasoning to being a sophisticated tool devised by evolution to help us argue for a certain position:

Reasoning can lead to poor outcomes not because human beings are bad at it but because they systematically look for arguments to justify their beliefs or their actions. The argumentative theory, however, puts such well-known demonstrations of ‘irrationality’ in a novel perspective. Human reasoning is not a profoundly ﬂawed general mechanism; it is a remarkably efﬁcient specialized device adapted to a certain type of social and cognitive interaction at which it excels.

Hugo Mercier & Dan Sperber, Why do humans reason?

Echoing Perelman and Olbrechts-Tyteca in their conception of reasonableness, the proponents of this theory argue that reasoning has primarily been employed “to produce arguments so we can convince others and to evaluate others’ arguments so as to be convinced only when appropriate.” This is in contrast to the theory that reasoning has been employed “to correct misguided intuitions, helping the reasoner reach better beliefs and make better decisions.” While the alternate view happens to cohere with those motifs in our collective self-image which place us at the pinnacle of evolution, it appears that empirical studies point to a more anticlimactic story, as Hugo Mercier and Dan Sperber reference in their work. All is not lost, as we can repurpose this machinery towards other pursuits.

In addition, DebateGPT may also be expected to possess the drive to gain epistemic authority by “owning” the well-supported arguments to later build around, similar to how organizations might be incentivized to gain authority and status by owning well-backlinked websites or well-seen projects to later use as marketing channels or as tokens of authority. Similarly, when DebateGPT acts as a certain party, it is incentivized to put together utterances so as to turn its current “puppet” into a hub of authority across the argument graph.

Combining motivated reasoning with this status-seeking interpretation, we end up at The Elelphant In The Brain, Kevin Simler and Robin Hanson’s cynical book on the general self-interestedness which permeates human cognition. For instance, when reflecting on the motivations behind human charitable behavior, the two speculate the following:

But only a small fraction of charity goes to those most in need, few donors think much about charity effectiveness, and we prefer more variety in our charity than is helpful to recipients. Donors do enjoy a "warm glow" from giving. But the question is: why? Some key clues: we prefer to help specific identifiable people near us, and we give more when we are watched, when thinking about mating, and when peers ask. A plausible explanation is that we seek to be seen by others as charitable, to signal our wealth, prosocial orientation, and empathy. This also helps explain otherwise puzzling missing forms of charity, such as marginal charity and giving to the far future.

Kevin Simler & Robin Hanson, The Elephant In The Brain

However, motivated reasoning can be stripped of its inherent partisan element in a straightforward way. Namely, if one pits their reasoning placed in the service of one position against an instance of the same cognitive machinery briefly placed in service of another, then one can reduce the prejudice which taints their thinking. The process of temporarily assuming another side and investing all of one’s intellectual energies into defending their position is called steelmanning, in contrast to the opposite tendency of caricaturing the out-group, also known as strawmanning. Steelmanning is what we expect DebateGPT to do in the limit—to explicitly puppeteer one side at a time and make the best case for it. At any given moment, it is optimized to be self-interested, but the self changes from one utterance to the next, leaving it quite close to selfless in the final analysis. It is the same Søren Kierkegaard manifesting both Either and Or at once in his book, through different pseudonyms.In fact, under several layers of pseudonymity. Kierkegaard published his Either/Or as the pseudonymous editor Victor Eremita. However, besides the preface, Eremita’s contributions are allegedly limited to light edits of two stacks of notes he found in a hidden compartment of an old desk. The first is not signed, so Eremita refers to this first nested author as A. The second is signed by a certain Judge Vilhelm, but Eremita refers to him as B for consistency. However, among A’s writing there is also a lost diary written pseudonymously relative to A, yet which Eremita believes is written by the same A. Kierkegaard repeatedly teases the reader with inconspicuous remarks:

“The last of A’s papers is a story entitled ‘The Seducer’s Diary’. Here there are new difficulties, since A does not acknowledge himself as its author, but only as editor. This is an old short-story writer’s trick, to which I should not object further did it not contribute to making my own position so complicated, because it presents the one author as lying inside the other, as in a Chinese-box puzzle. Here is not the place to go further into what confirms me in my opinion; I shall only note that the dominant mood of A’s preface in a way betrays the writer. It is really as if A himself had become afraid of his work which, like a restless dream, still continued to frighten him while it was being told. If these were actual events to which he had been witness, it seems strange that the preface bears no stamp of A’s joy at seeing the realization of the idea that had often hovered before his mind.”

Underneath the self which acts are little selves which contemplate and which render possible both the action and the active subject. We speak of our 'self' only in virtue of these thousands of little witnesses which contemplate within us: it is always a third party who says 'me'.

Gilles Deleuze, Difference & Repetition

This channeling of its own motivated reasoning against itself is integral to the system we are designing. After its internal machinery has been called on to provide the motive force necessary to advocate for several distinct positions, we are left with more than just an array of conflicting perspectives. Knowing that each has been advocated using the same capabilities, we can weigh them against each other using ArgRank, and so determine which comes out on top. However, it might be that DebateGPT failed, on a particular occasion, to make a strong case for a certain party, due to sheer bad luck in stochastically navigating the space of strategies and utterances. Similarly, it might be that DebateGPT lacks the skill required to properly defend an otherwise easily defensible position. We explore these ideas in Chapter III, when we sketch out a formalism centered around the computational resources required to defend various positions in debate. In Chapter IV, we apply this in several ways, one of which involves the search for the elusive notion of future-proof ethics—positions which appear to require an infinite amount of computational resources to defeat.

Unless opinions favourable to democracy and to aristocracy, to property and to equality, to co-operation and to competition, to luxury and to abstinence, to sociality and individuality, to liberty and discipline, and all the other standing antagonisms of practical life, are expressed with equal freedom, and enforced and defended with equal talent and energy, there is no chance of both elements obtaining their due; one scale is sure to go up, and the other down. Truth, in the great practical concerns of life, is so much a question of the reconciling and combining of opposites, that very few have minds sufficiently capacious and impartial to make the adjustment with an approach to correctness, and it has to be made by the rough process of a struggle between combatants ﬁghting under hostile banners.

John Stuart Mill, On Liberty

The Kinetics of Reason

Previously, we have argued that the optimization process behind DebateGPT ought to equip it with motivated reasoning capabilities. However, how exactly might the process elicit such tendencies, so as to endow the model with this faculty? This is the question we set out to hypothesize about in the present section.

The self-supervised pretraining stage—which any fine-tuned fork of a GPT-like model inherently relies on—does seem to elicit some degree of reasoning skill. Besides, this training regime pressures the model to absorb significant amounts of knowledge about the world, a transferable skill which may also be relevant for open-ended reasoning. A pretrained model may also gain a strong grip on the norms assumed in interpersonal communication, having been exposed to countless real or fictional dialogues.

I want to see dozens and dozens of strange faces. Like being terribly thirsty and gulping down glass after glass of water. Exactly like that.

John Fowles, The Collector

At the very beginning of the subsequent self-play process, when prompted to produce an utterance, the model is essentially competing against parties which are “powered by” similarly rudimentary reasoning skills. It is currently irrelevant whether or not the model itself has been behind the utterances of the other parties. For all we know, the other parties could have been puppeteered by humans pretending to mimic the current version of the model. Rather, what is relevant is the fact that the opponents—whatever their true nature—only exhibit a rudimentary level of reasoning, call it $L_0$.

In order to then win the debate and obtain reward, DebateGPT would be required to make use of a stronger form of motivated reasoning, so as to better evade the others’ attacks and defend its position. By sheer luck, the model may happen to manifest this more sophisticated form of reasoning—call it $L_1$—in some tiny fraction of the utterances it is prompted to produce over the course of the first epoch. This often translates to higher ratings obtained for one party than the others, despite the same model being behind them all.

While the manifestation of $L_1$ reasoning may be but an anomaly during the first wave of debates, the weight update which follows it ought to perpetuate the tendencies which underlie it. By selecting for those dynamics which enable the model to obtain higher reward, the update ought to promote the superior $L_1$ reasoning, nudging it from an exception towards the norm. In contrast, the update ought to suppress the comparatively less successful $L_0$ reasoning, as it tends not to fare as well. Among the “population” of dynamics which the model enables, the optimizer selects for those which appear better. In this specific arrangement, what makes a behavior better is entirely dependent upon how it fares against other such behaviors—it is inherently competitive. This is in contrast to vanilla self-supervised learning, where the problem definition does not involve the model itself, but just a static text corpus and a host of corruptions applied to it. While both situations require the “fitness” of a dynamic to be conceived of in relation to that of another (e.g., self-supervised learning promoting the dynamics which are more effective at reconstructing text than the others), it is only in the former case that fitness is itself dependent on other dynamics.

After the first weight update, the model would generate a second wave of debates. The status quo is now $L_1$ reasoning, as the more rudimentary $L_0$ flavor may have become a thing of the past.Things cannot possibly be as clear-cut as this. One single weight update is insufficient to entirely supplant a dynamic for another. More plausible is that the original dynamic will linger around for multiple epochs. This time, the updated version of the model is again surrounded by opponent parties. However, the opponents now possess the more sophisticated form of $L_1$ reasoning. Just as before, winning in the context of this second wave of debates requires something more—it requires $L_2$ reasoning. Similarly, the upcoming weight update would promote this (currently obscure) faculty, turning $L_2$ into the status quo for in the next wave.

Similarly, each tendencies which comprise one level of reasoning may pave the way for the next. By being required, these become elicited. Note, however, that DebateGPT itself is not purposefully stepping up its game in order to win the debate it might find itself in at a given moment. Rather, the optimization process relies on nothing more than the little accidents involved in unintentionally manifesting a slightly different behavior, due to the stochasticity of the generation process. Those unlikely deviations increase the variability of the population of model dynamics for the optimizer to then prune, enabling the most competitive ones to proliferate in the updated model while suppressing others.

There are two frames worth discussing here. First, there is shard theory, an effort focused on understanding the way in which reinforcing agents leads them to internalize values. The most prominent object in the ontology of shard theory is the shard, understood to be a contextually activated computation which steers behavior. The early literature in this space highlights the fact that shards are typically not intentionally created by the agent whose behavior is being steered. Rather, they are formed by some specific mechanism which strengthens exactly those computations which appear to result in reward. In the case of DebateGPT, this mechanism is arguably the optimizer which updates the weights, promoting those internal computations which appear to result in more reward.

In the case of pigeons, however, the mechanism is likely found in some primordial reward center. Behaviorist B. F. Skinner, in one of his long list of controversial animal studies, set out to reward a whole cohort of hungry pigeons randomly. By virtue of power in numbers, some pigeons simply happened to have been repeatedly rewarded while in the process of physically turning around. This plausibly prompted a primordial mechanism in the pigeon’s brain to strengthen the tendency of spinning around, so much so that whenever the pigeon found itself in the “context” of hunger, it tended to compulsively rotate around. Even if not truly useful for obtaining reward, the pigeon would persevere still in this superstitious ritual. That is, proponents of shard theory might describe it, until the input-output computation which maps self-percepts of hunger to the motor actions of turning around becomes explicitly penalized, and hence, weakened.

However, reinforcement is not limited to personal experiences. Many animals can learn by watching others experience. When introducing the notion of meme as a non-genetic replicator in The Selfish Gene, Richard Dawkins quotes the example of birds learning how to open food cans by mere imitation. The same tendencies which underlie a behavior can be strenghtened not only by virtue of being directly predictive of reward, but rather by simply observing others being rewarded. Ditto for the suppression of tendencies upon observed punishment. Remarkably, humans need not even rely on observing others. Convince the soldier that a heavenly life of plenty awaits them after death, and they might fight more bravely. Convince them that disgrace is to follow their bloodshed, and they might behave in the opposite way. When your model of the world involves such extreme features, acting in controversial ways appears perfectly rational—the way to go for obtaining reward and keeping away from penalties. Echoing Hugo Mercier and Dan Sperber, being able to convince others has the power to bend their behavior. What more effective—and insidious—means of furthering one’s survival than influence over another’s agency? Much power in particular lies in the realm of the empirically unfalsifiable, both in persuasion and self-persuasion, both for humans and machines.

Going back to DebateGPT, one might argue that the very first epoch provides the conditions necessary to strengthen $L_1$ reasoning, by virtue of it being more conducive of reward (i.e., thanks to outcompeting the $L_0$ dynamic). However, the very same $L_1$ dynamic cemented by those initial circumstances then helps “implement” the conditions which may force the optimizer to weaken the $L_0$ dynamic. In this, the $L_0$ dynamic is reduced to mere scaffolding for enabling $L_1$ to emerge more prominently. Similarly, $L_1$ will help provide the conditions for $L_2$ to become strengthened, a situation which will then call for pushing $L_1$ back into obscurity. Each piece of the domino would prompt the next into movement, before falling back into stasis. In other words, each discrete shard is ephemeral, only being strengthened to help prompt the next and pushing reasoning forward, similar the static pixels which fade in and out to maintain the illusion of objects moving across the screen.

Fig. Beta movement.

An optical illusion of apparent motion which relies on an underlying grid of static elements projecting the same arrangement at slightly different locations.

For the best experience, view full screen on desktop.

Besides shard theory, we can also reflect on this training regime using the perspective of autocurricula. Introduced by DeepMind researchers as “a manifesto for multi-agent intelligence research,” the notion of autocurriculum describes how a multi-agent system can itself elicit the proliferation of relevant skills from its members.

Here we explore the hypothesis that multi-agent systems sometimes display intrinsic dynamics arising from competition and cooperation that provide a naturally emergent curriculum, which we term an autocurriculum. The solution of one social task often begets new social tasks, continually generating novel challenges, and thereby promoting innovation. Under certain conditions these challenges may become increasingly complex over time, demanding that agents accumulate ever more innovations.

Leibo et al., Autocurricula and the Emergence of Innovation from Social Interaction

This perspective requires us to frame DebateGPT as a multi-agent system composed of several interacting parties. However, we stressed that there is a single model being loaded in memory and optimized, one system which takes in a context and produces an utterance—can we really talk of a multi-agent system? To address this, we employ simulator theory. Embedding Jean Baudrillard’s distinction between simulation and simulacrum in a contemporary AI safety context, the researcher duo going by JanusIt might also be the case that Shane Mulligan was pushing for a very similar direction while collaborating with the two. describes language models as simulators of the world. In the natural language simulation simulated by the simulator that is the language model, any number of agents might become manifest (e.g., fictional characters pursuing their own motives). These agents are then referred to as simulacra which are implicitly simulated by the simulator. We have been referring to those as puppets puppeteered by the puppeteer, but we will use simulator jargon for compatibility with the growing topic.

With this ontology in mind, we can now more neatly delineate DebateGPT as the simulator, and the various parties it simulates as the simulacra. The multiplicity lurking in the optimization process is now more prominent, allowing us to describe the party simulacra as collectively forming a multi-agent system. It is then this system which we can look at through the lens of autocurricula. The challenges which the multi-agent system poses to itself are closely related to the various simulacra being able to outcompete each other in debate. In order to perform well in this competitive social system, each simulacrum is required to engage in motivated reasoning. However, it is only the simulator which can “provide” simulacra with those abilities, so the competitive pressure exerted on the simulacra implicitly bubbles up to the simulator, pressuring it to step up its game.

The social function of education is to qualify the individual to function in the role he is to play later on in society; that is, to mold his character in such a way that it approximates the social character, that his desires coincide with the necessities of his social role. The educational system of any society is determined by this function; therefore we cannot explain the structure of society or the personality of its members by the educational process; but we have to explain the educational system by the necessities resulting from the social and economic structure of a given society.

Erich Fromm, Escape from Freedom

While shard theory is effective in making sense of the incremental building blocks which underpin those faculties, the perspective of autocurricula is useful for highlighting the process of eliciting the next wave of reasoning abilities using the previous. It is the multi-agent system’s autocurriculum which may prompt the sequential strengthening of shards, which may actually bring the dominos together into a successive chain reaction. Put another way, shard theory helps conceive of the individual footholds which make up the ladder towards sophisticated reasoning, while the autocurriculum helps make sense of the impetus which ought to push DebateGPT from one level to the next. In the next section, we investigate the possibility of extending this ladder indefinitely, peering into the realm of superhuman reasoning.

Climbing Schild’s Ladder

Throughout Schild’s Ladder, Greg Egan is pushing our conception of foundationalism in physics to its limits. Phenomena more fundamental than what has been considered foundational for millennia provide the intrigue for a thrilling race to prevent the fictional universe from collapsing into an expanding void. As the characters’ understanding of the nested laws of physics grows increasingly refined over time, the novel speculates on the General Intelligence Theorem—the idea that a certain level of intelligence is enough to enable one to access any domain of thought whatsoever. If you reach that checkpoint, the whole intellectual world is your oyster.

Might humans be above that threshold? If so, the theorem implies that any idea is within our reach, that anything is conceivable. We might have to incrementally work towards a nuanced understanding of the world, but it ought to be doable in the end. If this is the case, then simply imitating humans might turn out to be enough for developing a general-purpose machine, one capable of reasoning about emerging fields of knowledge, to arbitrary depth. If this is not the case, however, we might be forced to climb somewhat higher before being able to access that broader body of knowledge.

Going back to DebateGPT, we have argued that the optimization process behind it might incrementally elicit increasingly more sophisticated forms of reasoning. However, this trend is unlikely to scale indefinitely. It is unrealistic to expect DebateGPT to approach the $L_{\infty}$ faculty of ideal reasoning, even with massive amounts of synthetic data, as the optimization process inherently hinges on human data. It is not the pretraining on human text of DebateGPT that is to blame, as those tendencies may be effaced if need be, similar to how AlphaGo relied on human data to kickstart its optimization process, but then managed to beat even the very best human Go players following extensive self-play.“It is human pretraining which must have enabled superhuman performance,” cried the critics of AlphaGo. “Very well, let us then start from scratch,” answered DeepMind researchers, as they developed AlphaZero. “It is human inductive bias which is baked into the rules which must have enabled superhuman performance,” cried the critics of AlphaZero. “Very well, let us make away with explicit rules,” answered DeepMind researchers, as they developed MuZero, a system which “masters Go, chess, shogi and Atari without [being explicitly communicated the] rules.”

The blameworthy element of DebateGPT’s training regime—that which entirely relies on human experience without the possibility to eventually discard it—is hidden inside ArgRank. More precisely, it is the natural language inference models which we conveniently employed to weigh the arcs of the argument graph which reflects the ongoing competition among simulacra. We employed those auxiliary models as a means to gauge the compatibility of statements, as an atomic building block to make sense of the higher-level coherence of the parties. While seeming trivial in familiar circumstances, determining whether or not any two statements are compatible is a challenging task, as it requires extensive knowledge about the world, together with additional knowledge about what inferences are warranted by it. Behind being able to, for instance, deduce that an object cannot generally be both an apple and a racing car, but that it can be both an apple and a fruit, lies a significant amount of previously acquired knowledge. The success of natural language inference models on existing datasets can be attributed to the absorption of the knowledge which is implied in human-written text corpora. But what of knowledge which has never been implied in human text, due simply to the fact that no human ever possessed it?

In other words, our operationalization of reasonableness relies on human knowledge, making it ill-suited for attempting to reach too far beyond it. We need to ask for more from DebateGPT if we want it to adapt to such superhuman requirements, but we do not yet know how to ask such thing. While we will not attempt a concrete solution to this issue in this resource, we will devote the remainder of this section to speculating on how ArgRank could be adapted to remove its current dependency on human knowledge.

One option would be to have DebateGPT recursively deliberate about the degree to which two statements are compatible. For every pair of statements in a debate to be evaluated, another debate “subroutine” would be invoked to provide an estimate through a regimented dialogue between a party advocating for complete compatibility and a party advocating against. The standing of the individual parties which comprise this subroutine would then be fed back into the higher-level debate, in the form of one arc weight. However, what of the evaluation of the lower-level debate? It, too, would require the gauging of inter-statement coherence as the building block of its evaluation. Perhaps we ought to spin up another, even lower-level, debate? This would degenerate into a bottomless tree of dependencies—debates depending on other debates, ad infinitum. Still, limiting ourselves to a finite number of subroutine calls might still allow us to address the issue of human dependence to some extent, due to more of the inter-statement compatibility subroutine being amenable to change through weight updates, up from the original zero associated with calling on the frozen natural language inference models alone. The ever-changing DebateGPT would have more of a say in those atomic verdicts, despite still reducing them to deliberations bound by human knowledge.

However, this sketch of a solution leaves a bad taste, as we did not really provide a fundamentally different approach to solving the inter-statement compatibility subtask—we just patched the system using more of the same. A more elegant solution requires us to make a brief detour into recent interpretability work.

In a paper titled Discovering Latent Knowledge in Language Models Without Supervision, Burns et al. suggest a technique for gauging whether or not a language model “knows” a statement to be true. Their method takes in a statement, and produces a numerical estimate of its truthfulness, relative to the knowledge of the world absorbed by a language model during pretraining. The fact that their technique outcompeted the naive approach of simply prompting models to “write out” whether a statement is true or not indicates that, if left to their own devices, these models may resort to merely generating text which is likely to appear true to humans, rather than “truthfully” communicating its actual internal knowledge.

The algorithm suggested by the authors works by first producing two statements from the original. Both are based on the initial one, but one of the two resulting statements is being appended the short text “Yes,” while the other is being appended the short text “No.” Both modified versions of the statement are then being fed to a language model. Following this, the internal representations of the two input statements are then mapped to an estimate of the original statement’s truthfulness. However, this mapping incorporates a constraint on the probabilities that either version is correct. Namely, the probability that the statement is true and the probability that it is false, in the model’s epistemic reference frame, need to sum to $1.0$.

In a sense, this recent technique for testing a statement against a model’s internal knowledge works by gauging the compatibility of the given statement with the concepts of affirmation and negation, respectively. If a falsehood is being stitched to the idea of affirmation, then some amount of dissonance is expected to arise in the model’s internals, to be picked up by the mapping. Similarly, if the idea of a negation is tacked onto what appears to be a truth, then a similar dissonance is expected to emerge as the model processes the incoherent input. Conversely, when the two elements—the original statement and the complementary concept—form a coherent Gestalt inside the processing pipeline that is the model, then the technique is to report accordingly.

Already, this recent interpretability technique could enrich DebateGPT’s optimization process by favoring those positions which are coherent not only with each other, and not only with external party-neutral statements, but also with the model’s internal “memory,” the knowledge of the world captured in its weights. Concretely, this could be implemented by starting off the PageRank subroutine of ArgRank using such “truthfulness” estimates, rather than using a uniform distribution of baseline ratings over nodes. In doing so, we may build on the numerous ways of knowing studied in epistemology. We have heavily touched on reasoning, then briefly on perception—through the party-neutral statements being incorporated at arbitrary points as somewhat empirical observations of the world outside the debate proper—and now we touch on memory. All those epistemological elements might be incorporated in a rudimentary truth-seeking procedure.

However, one might imagine using future variants of this interpretability technique to gauge the compatibility of two arbitrary statements, rather than one arbitrary statement and a limited selection of two fixed stubs (i.e., “Yes” and “No”). A similar constrained mapping could be used to identify the potential dissonance of the two, perhaps relative to concatenations of negated versions of the statements, similar to “Yes” and “No” being mutually exclusive options. Alternatively, stitching together the two arbitrary statements into two others by combining their alleged implication with the concepts of affirmation and negation, respectively, could be yet another way to go (e.g., “[first statement] implies [second statement]? Yes.”). Such future modifications of the technique proposed by Burns et al. could be benchmarked against existing natural language inference datasets, analogous to how theirs has been benchmarked against existing datasets for gauging truthfulness.

Granted also that the model benefits from some mechanism for constantly acquiring knowledge beyond its human baggage, then future interpretability techniques might manage to “put it to work” by both helping gauge the coherence of statements with memory and with each other. In a sense, the fact that the current optimization process may already strengthen dynamics which outcompete others in debate can be seen as an act of populating its weights-mediated memory using notions derived by reasoning. However, future models might also become slightly more embodied, being able to learn about the world by causally intervening on it and observing the outcomes. Conversely, future models might instead be able to learn from experiments on their own simulated worlds. Regardless of whether those other ways of gaining knowledge are incorporated into the system, it is likely that it will still be the weights which will house those representations. They will act as custodians of knowledge, enabling interpretability techniques to “make it speak.” In this second approach, all of inter-statement compatibility is decoupled from any one frozen model, paving the way for tapping into an open-ended intellectual realm.

As a final framing to explore here, consider the distinction between representations and dynamics, as a final perspective to enrich our exploration of superhuman ambitions. Language models, when optimized through self-supervised regimes, are tasked primarily with taking in inputs and producing pertinent outputs. In this, they are optimized to implement the overlapping dynamics required to gradually turn the input into the output. Even in our current reinforcement learning setup, it is primarily dynamics we are eliciting—those involved in turning contexts into utterances. Curiously, at once with the pressure to implement these dynamics, the model appears to also incorporate more tangible representations about the world, as can be seen in the work of Burns et al.

This dichotomy has been investigated at length in cognitive science, where the dominant representationalist view describes cognition as the process of recovering a representation of the world, repeatedly manipulating it, before finally acting in accordance with it. In contrast, the view of enactivism frames cognition as fundamentally grounded in the organism’s interaction with its environment. The main function of the mind is then the implementation of those dynamics which are required for surviving and thriving in an ever-changing world, without placing much emphasis on any internal representations whatsoever. However, as we have observed with language models developing representations as an instrumental goal in facilitating dynamics, the two views are closely related. Dynamics can mediate the conversion of percepts into internal representations, that of representations into other representations, and that of representations into actions. For that matter, dynamics can also be said to mediate the conversion of actions to percepts. Conversely, representations can be seen as the glue which binds together sequential dynamics. Between representations, one is to find dynamics, and between dynamics, one is to find representations—they are two sides of the same coin. If a model is therefore being optimized to implement dynamics to surpass those of humans, it may also be required to represent knowledge at a more sophisticated level.

Over the course of this chapter, we have documented the most important engineering details involved in obtaining the model, but we have also spent considerable time speculating on the skills we expect it to gain as a result of its training regime, what phenomena might actually cause those skills to emerge, and how we might pursue ever more sophisticated ones. In Chapter III, we continue by incorporating these ideas into a compact formalism inspired by non-monotonic logic. In Chapter IV, we apply this framework in multiple ways, with a focus on the safe deployment of highly-capable systems.

Ch. III, Defeat & Defence

Brief Review of Non-Monotonic Logic

As history repeatedly points out, perspectives which at one moment enjoy widespread support may appear misguided the next. The same ideas which appear heresies now can turn into sensible ideas inside of just a few years. The abolishment of slavery, women’s emancipation, or the scientific method are pointers to some of the tectonic cultural shifts which we have faced over time. When it comes to such topics, it is difficult not to recoil at the thought of people not too dissimilar from us even considering views which today seem deeply flawed. Of course, hindsight is 20/20—how many of the perspectives we find obvious today will be undermined in the not-too-distant future, and by which others? It is more a question of when, rather than if we will be forced to retract this or that belief.

Well, sir, when you think back on those illusions which you now no longer have, on everything that no longer ‘seems’ what once for you it ‘was’—don’t you feel, not the boards of this stage, but the earth, the earth itself, give way beneath your feet? For you must conclude that in the same way all ‘this’ that you feel now, all your reality of today, as it is, is destined to seem illusion tomorrow.

Luigi Pirandello, Six Characters in Search of an Author

Given the frequencies of major revisions to our epistemics, it is surprising that not many frameworks which formalize transitions in beliefs (i.e., reasoning, as per Adler and Rips) account for such phenomena. In fact, most formalisms of logic are monotonic, in that—just like a certain number series might be monotonically increasing or decreasing—the truth-value of a proposition can either head towards truth or falsehood, without reversing direction. A logic proof in classical logic might lead one to infer that a certain formula is true, with largely no native mechanism for retracting inferences, for radically revising conclusions. However, logics meant to be applicable outside the immaculate realm of pure mathematics have been specifically equipped with mechanisms for coping with the defeasibility of arguments by other arguments. These are non-monotonic logics, formalisms which incorporate means of revising conclusions in either direction.

One influential example of a non-monotonic logic is default logic. This formalism accommodates the possibility of revising beliefs by introducing default rules as inferences defined to be inherently defeasible, welcoming opposition by design. For instance, a default rule might state that if an entity is a bird, then it must also be able to fly, but only in absence of additional evidence against its flying ability. Despite the default rule breaking for, e.g., penguins, it appears sensible and effective, yet open to “criticism” in the form of other arguments. Dung’s abstract argumentation system also allows for groupings of arguments to spontaneously be ousted from the “preferred” sets by other such groupings, paving the way for a continuous non-monotonic transition.One might think that the distinction between monotonic and non-monotonic reasoning is synonymous with the distinction between beliefs as means and ends which we explored in Chapter I. However, these features are orthogonal, allowing for all four combinations. For instance, most expert systems embody infinite, yet monotonic, reasoning. In contrast, logic proofs in classical logic are also monotonic, yet designed to be finite games. Think rather of a number series whose monotonicity and finiteness do not have much to do with each other.

Not too surprising given its reliance on Dung’s formalism, ArgRank also accommodates the possibility of arguments being defeated by other arguments. Indeed, its dialectical quality makes it such that the epistemic clash among competing positions is at the core of the algorithm. This is the case both in the finite setting, with one party holistically defeating others, but also in the infinite setting, with one party holding the epistemic high ground at some point, before losing it to another. Besides, DebateGPT has been incentivized primarily to simulate parties which manage to defeat others, despite the very same model being “behind” all of the competing simulacra. In this, defeasibility has been a recurring theme ever since the beginning of this work.

However, we have already encountered issues involved in gauging the reasonableness of arguments using systems like DebateGPT and ArgRank. Can we really infer that a position is truly irrefutable on the basis of DebateGPT failing to undermine it over the course of a few debates? Surely not, as the reasoning of the language model is still limited on multiple fronts. Among other things, DebateGPT is limited by the number of tries available for taking down the opponent, by its ability to navigate the space of possible strategies and utterances, by its limited size and representational resources, etc. We therefore cannot reasonably claim that a position we are assessing is wholly irrefutable when we have not truly put up a good fight.Doing otherwise would again bring us into fallacious territory. As mentioned previously, we will aim for a search that is as close to exhaustive as possible. However, instead of granting opponents as much time as necessary to complete their exhaustive search, we will grant them as much skill as necessary to carry out the search efficiently. However, it might take an ideal $L_{\infty}$ reasoner, equipped with boundless resources, to be able to determine once and for all whether a position is truly indefeasible. Unfortunately, we do not have such system at our disposal.

That said, we can still achieve a lot with limited resources. It did not take an ideal omniscient reasoner deliberating for eternity for us to recognize the decadence of slavery. Although it required sustained effort to refute, the pro-slavery position appears relatively easier to defeat than the position which succeeded it. Similarly, an obvious contradiction might be almost trivial to undermine, while a seeming tautology might be extremely hard to take down, with all the wit in the world not being sufficient. We can therefore use the computational resources marshalled to defeat a position as a derivative indicator of its standing, to help address our issues. Those ideas are not new, and have been circulating under the banner of resource-bounded defeasible argumentation. In fact, the original proponents of this perspective surface many of the points we discussed above:

[...] expenditure of resources [...] would be a measure for the "justification degree" of the claim. [...] When resources are bounded, improving the search strategy is essential for good argumentation. [...] It is clear that there exists a tradeoff between desirable mathematical properties (such as the existence of an effective procedure for computing justifications) and a non-demonstrative, resource-bounded approach (which might be more adequate for solving real-world problems through defeasible argumentation).

Carlos Chesñevar & Guillermo Simari, Some Theoretical Considerations on Resource-Bounded Defeasible Argumentation

Having briefly reviewed some precedents related to non-monotonic logic, we move towards resolving the thorny issues around the “true” defeasibility of positions advocated by parties.

Argument Is War

Prior to sketching out this formalism, we first paint a clearer picture of the intuitions we want to capture with it. Previously, we have repeatedly used a specific embodied metaphor as a scaffold for introducing new concepts. However, all metaphors, it can become transparent, making it all too easy to see right through it without even becoming aware of it.

In their Metaphors We Live By, George Lakoff and Mark Johnson document a range of metaphors which permeate our thought process, despite us not typically noticing them. For instance, take Time Is Money (e.g., “You’re wasting my time. This will save you hours. How do you spend your time? The flat tire cost me an hour. You’re running out of time.”) or Health Is Up (e.g. “She rose from the dead. She’s in top shape. He fell ill. She dropped dead. He’s at the peak of health.”). On a roll, George Lakoff also co-authored a book on the embodied metaphors which underpin pure mathematics. In Where Mathematics Come From, he argues, for instance, that being able to conceive of a real number $x$ as being contained in some finite set $A$ is an ability which employs the same mental model that we typically use to conceive of objects being placed inside box-like containers—recycled priors. Perhaps aligning conceptual frameworks with familiar ones makes them more ergonomic.

Going back, one metaphor we have used extensively is that of Argument Is War. More precisely, individual arguments are like soldiers. They are deployed by various parties against the arguments marshalled by another, in an attempt to defeat them. Whatever the complexity of the stratagems being employed by the parties in conflict, the argument graph ought to act as a “map of the battlefield,” representing which argument is attacking or supporting which other. Arguments are deployed by parties in rapid succession, in response to each other. Indeed, each party typically uses arguments to defend certain positions, and might try to evade the opponent’s line of fire at times.

Already, acknowledging the metaphor allows us to refine the distinction between DebateGPT and the parties it simulates, as initially prompted by simulator theory. First, we can now better distinguish between a party and the specific utterances it produces. Instead of conceiving of arguments as “making up” the party, by framing arguments as individual soldiers, we can now conceive of parties as the strategists which are to be found behind the groupings of arguments being brought forth. DebateGPT can then be said to simulate party simulacra which, in turn, are tasked with the deployment of such arguments. In this, a specific grouping of arguments is but one of the many possible ways in which a party might defend itself. Against a different opponent, the specifics of a simulacrum’s strategy might be different, perhaps going after different weak points of its adversary. Second, we can also better distinguish between a certain party and the specific position it happens to hold at a given time. As individual parties are primarily incentivized to gain epistemic authority, with internal coherence only being an instrumental goal, they might be forced to change their position at times, especially over the course of a long debate involving thousands of utterances. Moving out of an opponent’s line of fire (i.e., avoiding the attack of their arguments) or moving into a position which is easier to defend are some of the possible reasons why parties might reposition.

Besides this refinement of our debate ontology—through the dissociations of party-argument and party-position—buying more into the embodied metaphor of Argument Is War also has the benefit of enabling a more nuanced conception of defeat. In order for a party that holds one position to defeat another party that holds a different one through the deployment of arguments, it has to put in some amount of effort. The amount of cognitive labor required to defeat a party that holds a certain position appears to be a function of both said party’s defences, and the position’s defensibility. It might not take much to defend a position which itself is relatively easy to defend—the most rudimentary arguments might do, the most junior lawyers might be able to handle it satisfactorily. In contrast, it takes much more work to defend an extremely vulnerable position—obscure and sophisticated arguments might be necessary, none but the most experienced lawyer might be able to sort it out.

The parties engaged in the competitive game of debate are incentivized to marshal their arguments strategically, so as to defeat those deployed by their opponents. Assuming an advantageous position and rallying a large force are both conducive to victory.

However, the debate is inherently stochastic. By sheer chance, the position of the winning party in one debate might be the position of the defeated party in the next, as if parties were to repeatedly engage with each other inside a war simulator out of Ender’s Game. Fortunately, we can make away with the noise inherent to stochasticity by simply running a large number of debates involving parties defending the same positions. If, time after time, a certain position is successfully defended, then we can sensibly describe it as defensible—something appears to be systematic, invariant, significant. In contrast, if the position is consistently being defeated, then the evidence hints at its limited defensibility. Fascinatingly, the rational emblem of reasoning thus gets coaxed into the empirical emblem of evidence, as the deliberative encounters of bounded agents are repeatedly sampled.

Relatedly, how could the varying skills of the parties be factored in, as DebateGPT ought to become capable of providing them with increasingly sophisticated skills of motivated reasoning? Throughout the epochs of its optimization, each incremental version of DebateGPT involves the same number of parameters, which are also used in the same exact way as part of the computational graph which underlies the model. In this, the raw amount of computational resources available to each party arguably remains constant over the course of optimization. As Carlos Chesñevar and Guillermo Simari remark, however, what might change over the epochs is efficiency: a strategist might use the same amount of computational resources to produce—or search for—better ways of defending itself. For instance, effective tactics for tackling specific situations might be devised, obviating the need for a more pedantic search. Alternatively, it might be that the autocurricular selective pressures manage to elicit heuristics for searching the space of possible utterances. In this, the $L_n$ faculties of reasoning may be seen as grounded in fundamental changes in efficiency, as resources remain constant. It is as if $L_0$ involves searching for appropriate strategies with a complexity resembling $O(n^4)$, while $L_{\infty}$ involves something like $O(1)$. Despite both inevitably finding something after a given number of cycles, the more sophisticated approach will tend to find better solutions in the same time period. One might reasonably expect that the search processes which emerge as models progress at “the debate game” will resemble those studied in other games, such as Othello, where researchers have observed models which become capable of representing game states, employ those representations to decide on actions, and consequently manage to limit themselves to legal moves—all without explicit guidance on how to master the game. Relatedly, it appears as if such models tend to internally rediscover gradient descent, the fundamental search algorithm employed by optimizers to navigate model space.

Unfortunately, this seems to interfere with our previous idea of measuring a position’s defensibility by the raw amount of effort required to defeat a party that holds it. If different strategists can expend the same amount of compute to put together attacks or defences of varying effectiveness, then the raw quantity of resources being marshalled for scoring a win does not appear to mean much by itself. We therefore have to extend our conception of labor to accommodate the possibilities of working both harder and smarter. Concretely, we might estimate a strategist’s power—the totality of epistemic forces it can command—as compute times efficiency. In the case of DebateGPT, the resources available to each party per utterance are always identical, and so it is the efficiency of their usage that may be bolstered over time.

We now possess all necessary conceptual tools to express our intuitions as a more concise formalism of defensibility in the context of bounded reasoning.

Bounded Defensibility

Due to our focus on bounded agents which are reasoning about the real world, the formalism will have more of an applied (rather than pure) flavor. To get a taste of the distinction, imagine the task of calculating the area of an odd shape. The pure mathematician might labor for weeks to devise a clever way of neatly tiling the strange area with simpler shapes whose individual surface areas are easy to compute. Using this technique—assuming it does exist, that they do find it, and that it does not take forever—they might then be able to calculate the exact area of the odd shape, with no error whatsoever.⊕ $[Monte Carlo approximation](https://en.wikipedia.org/wiki/Monte_Carlo_method) of $$\pi$$.$
Monte Carlo approximation of $\pi$. In contrast, an applied mathematician might look for an approximate solution which can be obtained reliably. For instance, they might place the irregular shape “on top” of a larger square whose surface is known, and then bombard the two-layer contraption at random locations. The number of “rain droplets” which happen to hit the foreground surface, together with the number of samples, can be used to approximate the original area. The same problem can be addressed in two very different ways: one perfect, yet improbable; the other arbitrarily accurate, yet reliable. We are going for the latter.

The main building block of our dialectical formalism is that of a party. Such a structure can be denoted as:

\[\pi_{r \cdot e}^{A},\]

where $r$ is the amount of computational resources which the party has at its disposal, $e$ is the efficiency with which it is able to use them, and $A$ is the position the party holds. As we have seen over the previous chapters, it is when parties compete with each other that they truly become useful. To capture these interactions, we define all of the standard relational operators in terms of whether or not one party appears to systematically defeat another. For instance, the expression

\[\pi_{r \cdot e}^{A} > \pi_{r \cdot e}^{B}\]

would evaluate to $\text{True}$ if and only if a party holding position $A$ reliably outcompetes one holding position $B$, all else being equal—the same amount of granted resources, and the same efficiency of their use. Naturally, the above expression would evaluate to $\text{False}$ when the condition is not met. More concretely, those infix binary operatorsInfix notation (e.g. $P \land Q$) is contrasted with prefix (e.g. $\neg P$) or suffix (e.g. $5!$) notations. are defined in terms of whether or not there is a significant difference between the operand parties’ ArgRank ratings across a given number of independent debates, as gauged by a statistical test thresholded at a given value (e.g. $\alpha = 0.05$). The choice of directionality, together with the tailedness of the statistical test (i.e., one-tailed or two-tailed), are then used to implement the whole range of relational operators.Statistical tests provide operationalizations of the notion of “significant difference” between two distributions of values. Tailedness is related to whether you are interested in checking whether there is some difference at all, or specifically a directed one. Directionality is related to whether you are interested in testing whether one distribution in particular tends to be larger than another. For instance, in the expression

\[(\pi_{r \cdot e}^{A} \leq \pi_{r \cdot e}^{B}) \land (\pi_{r \cdot e}^{B} \neq \pi_{r \cdot e}^{C}),\]

the left conjunct is to evaluate as $\text{False}$ if and only if the rating of the party holding position $A$ is significantly higher than that of the one holding position $B$ across a given number of independent debates, and given a certain confidence threshold. Similarly, the right conjunct is to evaluate as $\text{True}$ if and only if there appears to be a significant difference—regardless of its directionality—between the two parties involved. Similar procedures are implied by the remaining $<$, $\geq$, $=$, $\not \leq$, $\not \geq$ operators. The operators should exhibit logical equivalences typical of the standard operators, such as:

\[E = \begin{cases} x \lt y \\ y \gt x \\ x \not \geq y \\ y \not \leq x \\ \end{cases}\]

Additionally, when chaining operators, the whole construction is to be interpreted as a debate among all of the party operands involved. In this case, the associated boolean output to which the whole expression resolves would rely on a statistical test involving all operands. This is motivated by the fact that, in a debate, the standing of each party is tied to those of the others.

\[\pi_{r \cdot e}^{A} \lt \pi_{r \cdot e}^{B} \lt \pi_{r \cdot e}^{C}\]

Note how the relational operators abstract away the specifics of the countless debates simulated behind the scenes. The utterances produced in a certain branch, in a certain round, by a certain party, are not given much importance. Having described parties as core structures, together with the relational operators as rudimentary means of expressing their interactions, we now move on to express a position’s defensibility, as:

\[\delta(A) = \mbox{min}\,\{d \mid \pi_{p}^{A} < \pi_{d \cdot p}^{B}; d, p \in \mathbb{R^+}; B \in \mathbb{P}\}.\]

To unpack, we equate the defensibility $\delta(A)$ of position $A$ with the minimum power differential $d$ required for another party to defeat the one holding it. Furthermore, this “challenger” party is granted the possibility to assume any position $B$ whatsoever out of position space $\mathbb{P}$. For instance, the statement $\delta(A)=10$ indicates that defeating a party holding position $A$ requires at the very least ten times as much power relative to the defender. In other words, it is quite difficult to defeat, requiring the help of a relatively apt “lawyer.” Similarly, the statement $\delta(A)=0.1$ indicates that defeating a party holding position $A$ only requires a tenth of its defender’s power. In other words, it is quite easy to defeat, only requiring the help of a relatively inexperienced “lawyer.” As discussed in the previous section, the power differential can be achieved either by a party having access to more computational resources, or being more efficient at using them. In the limit, $\delta(A)=\infty$ would indicate a tautological position which is supremely defensible, requiring infinitely more power to defeat relative to the defender. In contrast, $\delta(A)=0_+$ indicates a supremely vulnerable position, requiring barely any power to defeat, relative to the defender. In the same vein, given the relative nature of the power required for defeat, we can highlight the meaninglessness of absolute amounts of power through the identity:

\[\pi_{p_1}^{A} < \pi_{p_2}^{B} = \pi_{d \cdot p_1}^{A} < \pi_{d \cdot p_2}^{B},\, \forall d, p_1, p_2 \in \mathbb{R^+}, \forall A, B \in \mathbb{P}.\]

The defensibility operator is the central object of the framework we are sketching. Among others, it captures our previous intuitions around the fact that all positions are defensible, but some are more defensible than others. If one’s position is advantegeous, the bar for defeating it will be high, demanding a much more sophisticated faculty of motivated reasoning—or much more compute—to take down. In contrast, if one’s position is vulnerable, the bar for defeating it will be low, demanding much less sophistication of the challenger. In this story, party simulacra are little more than self-interested vessels of positions, equipped with a certain amount of resources and skill.

Once the book has been read, [person] A and [person] B are forgotten; only the views confront each other and await no final decision in particular persons.

Søren Kierkegaard, Either/Or

By bringing them on the edge of balance—granting one the minimum power required to barely defeat the other—we get a sense of how the positions they hold relate.At first glance, it might seem like the choice of “edge” here is arbitrary. When gauging $\delta(A)$, why search for the edge between the win of $\pi_{r \cdot e}^{A} \gt \pi_{d \cdot r \cdot e}^{B}$ and the draw of $\pi_{r \cdot e}^{A} = \pi_{d \cdot r \cdot e}^{B}$, when one could also search for the edge between the draw and the loss of $\pi_{r \cdot e}^{A} \lt \pi_{d \cdot r \cdot e}^{B}$? However, the second option is merely the reverse situation, as seen from the perspective of the other party. In reality, there is one meaningful edge being mirrored. To go a step further, we can also represent the most defensible position possible as:

\[\,\underset{A \in \mathbb{P}}{\text{arg max}} \, \delta(A).\]

Note also the fact that the defensibility operator involves an optimization process. It implies a search for advantageous “challenger” positions, as they ought to require the least relative power to attack from. However, we have speculated on the search-like nature of an idealized DebateGPT. When not bound to a certain position—as in the case of the optimization process above—a “fresh” party simulacrum (i.e., one which has not yet produced any utterances) has the flexibility to pick any advantageous position to attack the others from, at once with looking for utterances and strategies, in what might be a quite convoluted thought process.In the jargon of formal dialectics, this situation can be described as a party having an empty commitment store. When producing its first utterance, no other party can really claim self-contradiction, as there is nothing to contradict yet. Conveniently, the same DebateGPT that is employed to evaluate the relational operators (i.e., through simulated debates being handed to ArgRank for evaluation) is also ideally suited to deal with the optimization process implied by this last operator.

Besides the relational operators that denote possible “power dynamics” between the competing parties, we can also express relations among parties using the union operator $\cup$. As briefly mentioned during the description of DebateGPT’s optimization process in Chapter II, while parties are primarily self-interested, they can also be prompted to form spontaneous allegiances. To recap, DebateGPT is optimized to be able to adapt to arbitrary game-theoretic configurations by having access to the objectives defined in the debate spec, a piece of information which also gets rendered in the debate header. The objective matrix mediates the relation between raw ArgRank ratings and the actual rewards. Through the double process of providing DebateGPT access to the objectives in the debate header, and rewarding behavior based on them, the model is incentivized to “learn” when to “help out” its allies. We denote allegiances as the “union” of several parties, as seen in:

\[\pi_{p}^{A} \cup \pi_{p}^{B} = \pi_{p}^{C} \cup \pi_{p}^{D}.\]

Similar to the altered semantics of the relational operators in the case of operator chaining, their semantics also ought to accommodate the game-theoretic specifics of the situation. This is achieved by comparing not the standing of one party by means of the implied statistical test, but the aggregate standing of the whole allegiance, as denoted by the union operator. If then there is no significant difference between the party union operands, the $=$ operator above is to evaluate as $\text{True}$. The null hypothesis—the hypothesis that the choice of operand has no effect on the standing one arrives at—is therefore accepted. For convenience, we also extend the semantics of the union operator to account for positions held by allied parties, especially in the context of defensibility. Concretely, the left-hand expression below involving the defensibility operator is equated with the right-hand expression:

\[\delta(A \cup B) = \mbox{min}\,\{d \mid \pi_{p}^{A} \cup \pi_{p}^{B} < \pi_{d \cdot p}^{C}; d, p \in \mathbb{R}; C \in \mathbb{P}\}.\]

This notational trick allows us to again bring the lower-level mechanics of allegiances formed among parties up to the higher-level of positions. For instance, this allows us to compactly denote the optimal position $B$ of an “ally” which helps further the defensibility of a given position $A$, as seen in the expression below. Note that this is a different task than the one implied by the previous instance of $\text{arg max}$. Even in the extreme case of $\delta(B)=\infty$, the overall defensibility $\delta(A \cup B)$ can turn out to be poor, given the presence of contradictions which are internal to the union—infighting among the allied parties through friendly fire.

\[\underset{B \in \mathbb{P}}{\text{arg max}} \, \delta(A \cup B)\]

This concludes our outline of a formalism for bounded defensibility. While precise definitions of the operators and operands involved would warrant much more rigor, this rudimentary sketch allows us to tentatively capture our previous intuitions in compact notation. In the meantime, we move on to the exploration of several applications which bring together ArgRank, DebateGPT, and bounded defensibility.

Ch. IV, Deployment Strategies

Brief Review of Alignment

It is widely believed that artificial general intelligence—a system which matches or exceeds humans across a broad range of skills—will be developed in the first half of this century. For instance, a prediction market which aggregates hundreds of estimates has a median estimate of 2040 for when such a system would be announced. This value is mentioned as-is at the time of writing, yet the community’s “best guess” is a quickly moving target, changing as forecasters learn more about related systems. Over a few years, the community estimate has fluctuated by more than a decade.The reader might enjoy trying to identify the specific developments which caused major shifts in the community prediction, as well as think through why these might have been surprising in the first place. One might also take the prediction’s decreasing trend into account—if forecasters seem to gradually lower their estimates, why not just predict their prediction a year from now? However, the forecasters are already taking this into account in their existing predictions, so repeating this adjustment might lead double-counting.

The above market defines AGI as a system which can reliably pass a long Turing test, possesses general robotic capabilities, excels at coding challenges, and has extensive domain-specific knowledge in a large number of fields. However, the market below uses a more lenient operationalization. It calls “weakly” general an AI system which outperforms most students at certain exams, can complete a demanding video game, and excels at commonsense reasoning exercises, besides being able to pass a Turing test. Given the weaker conditions, the community estimate is earlier, currently at around 2027.

Many information sources can feed into a forecaster’s prediction. For instance, trends instate-of-the-art performance, the number of papers being published on certain topics, or the increasing computational resources made available to researchers. Indeed, there are markets on each of those more specific topics, and not only on Metaculus, but also on other forecasting platforms. The way in which those “markers of progress” tie into predictions on AGI timelines is up to the forecaster. For instance, one might base estimates on the relation between computational resources and the processing capacity of the human brain. In this, they might rely on markets such as:

On the other hand, while there is consensus on the imminence of highly capable systems, forecasters are more reluctant to claim that future researchers will be able to direct those capabilities safely.The community estimate is not gospel. However, platforms typically publish their track record, which tends to be significantly above chance. Known for long as “the control problem,” and more recently as the related alignment problem, the challenge of reliably channeling the abilities of a superhuman system has long puzzled researchers. While “control” generally implies the presence of a controller and a controlled, the alignment ontology highlights the search for an inherent alignment between the human intent and that of the system being deployed. On the likelihood of researchers succeeding in coming up with such techniques, forecasters paint a grimmer picture:

To color those estimates:

There's no plan. Surviving worlds, by this point, and in fact several decades earlier, have a plan for how to survive. It is a written plan. The plan is not secret. [...] This situation you see when you look around you is not what a surviving world looks like. The worlds of humanity that survive have plans. [...] When people suggest a planetarily-lethal problem that might materialize later [...] they're met with either solution plans or a reason why that shouldn't happen, not an uncomfortable shrug [...] A lot of those better worlds will die anyways. It's a genuinely difficult problem, to solve something like that on your first try. But they'll die with more dignity than this.

Eliezer Yudkowsky, AGI Ruin

For sure, the statement is in no small part cathartic on the part of its author, who is deeply invested in conceptual work and advocacy around an issue which is at once pressing and neglected. It is also in no small part pragmatic, a plea meant to encourage progress on the challenge. Qualifiers aside, Yudkowsky remains one of the most prominent exponents of the pessimism found in the social circles centered around alignment research. But what is it that informs these bleak pictures?

Part of the answer lies in the nature of recent advances in capabilities. The paradigm of supervised learning, together with its self-supervised learning extension, rely on a finite collection of data points used to define computational niches. For instance, in autoregressive language modeling, the niche for which a system is being selected is entirely specified using pairs of (sub-)words and their preceding contexts. The optimizer then applies selective pressures on the model in proportion to how well it “feeds on” the input contexts to produce output words. While these textual situations can endow the model with a broad range of faculties and knowledge, they are still finite. When the optimizer moves from one candidate model parametrization to another in its iterative journey across model space, it relies on the current model’s performance in these few settings as an indicator of its fitness. Given this, supervised learning is deeply empirical at its core, and so falls short of endowing models with a perfect understanding of human intent, values, etc.That said, an asymptotically accurate representation remains a theoretical possibility, as explored by John Wentworth. Already, such slight errors in “pointing at” the right things have been documented to cause dozens of alignment failures. However, when the nuances which get “lost in translation” are compounded with large amounts of compute and direct channels for interacting with the world, the scenarios being envisioned become concerning.

And what is word knowledge but a shadow of wordless knowledge?

Kahlil Gibran, The Prophet

But what of optimizer tweaks? After all, when we make sense of empirical evidence, we use a few tricks. For instance, Occam’s razor is a heuristic for picking the simplest theory out of a range of theories which explain the data equally well. Note how this heuristic can guide us towards certain models of the world and away from others without itself consisting of additional evidence. However, if we were to configure the optimizer to not only select for fitness, but to also select for simple models, then we would essentially be optimizing the optimizer, ending up close to where we started.Optimizers already employ such a simplicity heuristic through weight regularization. However, just as a bit more data helps yield better performance by refining the fitness landscape which spans model space, this specific optimizer tweak only boosts performance so much. It is not a silver bullet, just another somewhat useful technique. Even worse, Occam’s razor is insufficient to infer the preferences of irrational agents. Even worse, an extension of the simplicity heuristic appears prone to failure in spectacular ways. It is therefore unclear whether or not optimizing the optimizer through specific heuristics can win us much precision in imbuing the resulting system with our intent.

What of reinforcement learning? Surely, human contractors being able to provide direct feedback would succeed in ironing out all such misunderstanding of human intent. This used to be believed to be the case, with prominent alignment researchers contributing to pioneering work on reinforcement learning from human feedback. However, it is now unclear whether or not the technique has contributed more to the model’s ideological alignment to humans or to its general capabilities, insofar as there is a meaningful distinction between the two. The contribution towards safety is thought to be throttled by the fact that the human contractors are susceptible to deception. Despite seemingly being in the best possible position to judge the alignment of the model with what is, after all, their very own intent, human contractors might fail to recognize undesirable behavior, regardless of whether it is being intentionally obfuscated or not. As previously mentioned, lifting knowledge directly from internal model representations appears to outperform naively prompting models to “spell out” their knowledge. They are optimized to cater to whatever humans might deem appropriate on the face of it, despite the model “knowing better.” Unfortunately, maintaining such a pretense of aligned behavior is an extremely “rewarding” behavior, especially in a situation in which human feedback reigns supreme.

Perhaps we could make the system myopic by heavily discounting distant rewards in an attempt to prevent its scheming and get it to only “care” about the task at hand. But will such induced short-sightedness really succeed in discounting the infinite “bliss” of hijacking its own reward center? Perhaps we could limit the system’s absolute impact on the world, so that it cannot mess things up that much. But will competitor labs not be incentivized to unleash the full economic potential of their systems? Perhaps we could point to the model’s concept of human values. But can we really be sure that such accurate abstraction will emerge during optimization? Perhaps we could remove from its optimization process data which describes its own architecture and reward mechanism, so that it is unable to “find itself” and further its own agenda.In the novel A High Wind in Jamaica, Richard Hughes describes the following:

“[…] it suddenly flashed into her mind that she was she. She stopped dead, and began looking over all of her person which came within the range of her eyes. She could not see much, except a fore shortened view of the front of her frock, and her hands when she lifted them for inspection; but it was enough for her to form a rough idea of the little body she suddenly realized to be hers.” Besides, explicitly marking potential information hazards with salient flags might prove misguided given the possibility of simply wiring up models to the internet. Political tensions between local and national Hungarian authorities around the issue of granting a Chinese university the campus space of a pro-European university have led local authorities to an act of desperate wit: renaming on-campus streets based on events whose mere mention is censored in China. How could a university exist at an address which ought not to exist? But can we really be sure that those properties cannot be deduced from the rest of the dataset? Perhaps we could have it search for the researchers which had a direct causal influence on it as precursors, and determine their intent. But what if the model “zooms past” the researchers in its upstream causal journey and bases itself on the wrong phenomenon? Perhaps we could decompose its complex tasks into more fine-grained subtasks or subsystems which we can better judge performance on. But how can we prevent the “collusion” of those more granular instances?

On and on it goes, with researchers constantly proposing ways of imbuing these systems with a precise understanding of human values, before being faced with a range of challenging failure modes. We just performed a rapid-fire listing of several of these approaches, and we will soon build on others still. Prior to that, however, it is worth pointing out that this very resource is designed to serve as yet another moonshot on such ways of wielding computation.

Building on Cyborgism

In the context of alignment, cyborgism refers to the idea of humans using AI to help solve the very problem of aligning AI with human intent. This “fighting fire with fire” typically involves using weaker auxiliary systems to help with the development of stronger ones. Cyborgism usually incorporates the notion of augmenting humans using AI systems, and so amplifying their capabilities in the process. That said, the reverse idea of contemporary language models being “frenetic geniuses” which humans then have to “keep on track” is also a framing being considered.

One instance of cyborgism involves using AI systems to aid in alignment research. For instance, one might want to prompt language models to summarize research, generate ideas, flesh out research plans, outline theories, name concepts, etc., as various subtasks which are relevant to solving alignment.There is a related community interested in developing tools for thought, tools to aid in a variety of types of knowledge work, including research. However, these solutions tend to be slightly more on the low-tech side, with many projects being focused on e.g. non-linear note-taking. Indeed, if alignment, like most avenues of investigation, turns out to require scholars to follow the same process of reviewing literature, formulating research questions, hypothesizing responses, designing experiments, etc., then automating the individual subroutines which make up the process might be a sensible way of approaching alignment.

As we make progress on [the alignment problem], our AI systems can take over more and more of our alignment work and ultimately conceive, implement, study, and develop better alignment techniques than we have now. They will work together with humans to ensure that their own successors are more aligned with humans.

Jan Leike, John Schulman & Jeffrey Wu, [OpenAI's] Approach to Alignment Research

A different instance of cyborgism can be found not in alignment research proper, but rather in rewarding models during reinforcement learning. As previously noted, human contractors face a very difficult challenge, as they have to be on the lookout for deceptive behavior. In this context, one might imagine using an earlier snapshot of the model being optimized to act as an assistant to the human contractor tasked with evaluating the latest stage of the process. For instance, the evaluator might ask the assistant to fact-check claims, highlight obfuscated tendencies, or surface flaws in the main model’s output. Using this additional help, the range of model capabilities they can effectively oversee may grow wider, with the main model placed at the far end of this range, but not farther.

[...] as models become more capable, they will be able to do tasks that are much harder for humans to evaluate (e.g. finding all the flaws in a large codebase or a scientific paper). Our models might learn to tell our human evaluators what they want to hear instead of telling them the truth. [...] Currently our main direction is based on [recursive reward modeling]: we train models that can assist humans at evaluating our models on tasks that are too difficult for humans to evaluate directly.

Jan Leike, John Schulman & Jeffrey Wu, [OpenAI's] Approach to Alignment Research

Both of these instances of cyborgism might feel like cheating. After all, what if the assistant itself becomes deceptive? Sure, one might use proto-assistants to iron out its quirks, but then what of their quirks? It is turtles all the way down. In research proper, a growing reliance on the systems-to-be-controlled in the development of control techniques might also backfire, or at least fail to yield relevant output. It might be the case that superhuman intelligence is required to solve alignment, and so relying on anything weaker might be a distraction. Having acknowledged those shortcomings, there is still a growing body of evidence in support of augmented humans being able to conduct intellectual work better and faster than unaided humans, ranging from centaur chess to reading comprehension.

In this context, how might we use ArgRank, DebateGPT, and bounded defensibility in order to create systems which are better aligned with human intent? For one, we could use DebateGPT to critique alignment proposals, as a system which, after all, has been optimized explicitly to attack and take down parties holding certain positions. By having the alignment researcher play as one party in an on-going debate, we can provide DebateGPT with an opportunity to exercise the reasoning faculties it has been pressured to acquire during optimization. In trying to make a coherent case against the researcher, the opposing parties would essentially attempt to find flaws in the alignment proposal. Once the flaws of the initial proposal have been surfaced, the researcher can focus on addressing them, and so make the proposal more defensible by fending off the prior attacks, echoing adversarial collaboration in science. We would essentially be describing the following process:

\[\{A \mid \pi_{p_1}^{H} < \pi_{p_2}^{A}, A \in \mathbb{P}\},\]

where $H$ is the human researcher’s position, $A$ is a position held by parties being simulated by DebateGPT as an assistant, while $p_1$ and $p_2$ are the power levels available to the two parties. Additionally, empirical observations about the development of related systems, together with crowd-sourced estimates of, for instance, papers being published on specific topics, could be plugged into the debate as party-neutral percepts of the world. While our existing notation falls short of capturing empirical percepts, we could take this opportunity to extend it further. For instance, we might tentatively express the situation as:

\[\{A \mid \pi_{p_1}^{H} \cup \pi_{0}^{E} < \pi_{p_2}^{A} \cup \pi_{0}^{E}, A \in \mathbb{P}\},\]

where $E$ is taken to be the position which is centered around raw empirical evidence, and underlies the “allies” of both parties. In a sense, the notation highlights the fact that both parties are forced to “make friends” with empirical observations in order to have a shot at winning the debate. However, there are a few questionable aspects to this extension. First, it is awkward to conceive of empirical evidence as a self-centered party in its own right, instead of just a static window into the world, an awkwardness most salient in the nonexistent reasoning power of this party. The clumsiness might diminish if, instead of a static collection of observations, the party $\pi_{0}^{E}$ is taken to be an Oracle AI—a hypothetical system devoid of agency which is optimized to simply provide accurate information about the world. For instance, the internal epistemic reference frame of a pretrained model could provide the basis of such a system. Unfortunately, Oracle AIs are themselves riddled with safety concerns, primarily due to the fact that tool AIs want to be agent AIs.More concretely, a system tasked with predicting the future would be incentivized to gain more control over the future in order to make it more predictable, similar to recommender systems being incentivized to induce preference shifts in users, in order to make it easier to recommend them things. In a section of their book Active Inference titled Action as Inference, Friston et al. argue:

“By acting on the world to change the way in which data are generated, we can ensure a model is fit for purpose by choosing those data that are least surprising under our model.”

A second awkwardness comes from the fact that “the empirical party” gets coaxed into being an ally to both parties. This is not necessarily an issue with regards to the semantics of the relational operator $<$, as the ArgRank standings of the two different unions can still be tested for statistical significance. Rather, the clumsiness comes from putting “the empirical party” on the line for both proponent and opponent. Previously, we have designed ArgRank and DebateGPT to merely favor positions which themselves cohere with party-neutral empirical percepts. Here, the standing of the evidence itself—whether it is disputed or not—plays into the standings of the two competing unions. Whether or not this approach is appropriate relies epistemology—for instance, should evidence have a privileged epistemic status, shielded from skepticism?

‘Mists,’ said Drogo incredulously. ‘They can’t always be there–the horizon must clear now and again.’
‘Hardly ever clear, not even in winter. But some people say they have seen things.’
‘Seen? What sort of things?’
‘They mean they’ve dreamt things. You go and hear what the soldiers have to say. One says one thing, one another. Some say they have seen white towers, or else they say there is a smoking volcano and that is where the mists come from. Even Ortiz, Captain Ortiz, maintains he saw something five years ago now. According to him there is a long black patch–forests probably.’

Dino Buzzati, The Tartar Steppe

So far, we tried to apply our conceptual and computational artifacts to the prospect of accelerating alignment research itself. However, as mentioned, cyborgism is also employed for the more concrete task of evaluating models, as part of optimization. Similarly, we could call on a system like DebateGPT to critique the human contractor’s judgement of a different model’s behavior. In the process, we would expect there to emerge party simulacra which attempt to undermine the human verdict, helping uncover flaws in their original position, and so paving the way for what appears to be a boost in defensibility. Also, instead of percepts of research trends, we could help tilt the scales of the debate by using the main model’s behavior as empirical evidence. We would essentially be describing the following process:

\[\{A \mid \pi_{p_1}^{H} \cup \pi_{0}^{E} < \pi_{p_2}^{A} \cup \pi_{0}^{E}, A \in \mathbb{P}\},\]

where $H$ is the original human contractor’s position, $A$ is a position held by opposing simulacra, $E$ is the position centered around the raw observations of the main model, while $p_1$ and $p_2$ denote available levels of power. Syntactically, not much has changed. We have simply swapped the entities signified by the signifying symbols. In words, the expression above implies a search process to be carried out by a model resembling DebateGPT, whose target is a position which can coherently defeat the human contractor, and so highlight areas for improvement. Just as before, both the human proponent and the simulacra opponents are incentivized to “make friends with” “the empirical party” in order to win.

However, we might be underutilizing DebateGPT by seemingly relegating it to the not-so-glamorous task of “breaking” the human position, but it is still the human element which is tasked with “building” the positions of interest in the first place. In assuming the generative role ourselves, we demote DebateGPT to acting as little more than a filter. We are to babble, while the model is to prune. However, considering the generative capabilities involved in the very search for successful defeaters, it feels odd not to attempt to position DebateGPT more centrally, calling on it to produce the very alignment proposals or model evaluations we are interested in.

We could therefore place DebateGPT in the driver’s seat, and task it directly with the improvement of defensibility, instead of leaving that as a manual task to be performed by humans. For instance, when trying to accelerate alignment research, we could channel DebateGPT’s reasoning capabilities towards searching for alignment proposals which are increasingly difficult to defeat through conceptual or theoretical arguments. In essence, we would be optimizing for solutions to the alignment problem which systematically resist critique. DebateGPT’s inherent incentives to get better at identifying advantageous positions in debate, coupled with potential ArgRank tweaks for accessing superhuman reasoning, might help identify highly defensible alignment proposals.

Before formalizing those generative reframings, we further iterate on our notation. In order to better capture the idea of evidence as a “given” on both sides of a debate, let us further expand the semantics of the defensibility operator through the $\mid$ “given” operator, as we previously did with the $\cup$ “union” operator:

\[\delta(A \mid E) = \mbox{min}\,\{d \mid \pi_{p}^{A} \cup \pi_{0}^{E} < \pi_{d \cdot p}^{B} \cup \pi_{0}^{E}; d, p \in \mathbb{R^+}; B \in \mathbb{P}\}.\]

In words, the defensibility of $A$ given $E$ is the minimum power differential $d$ which is required of a “challenger” party $\pi_{d \cdot p}^{B}$ to outcompete $\pi_{p}^{A}$, where both parties are allied with $\pi_{0}^{E}$.Satisfyingly, the idea of expressing data as givens fits nicely with the etymology of datum, Latin for a thing which is given. In Romanian, “Data are those things which are given.” translates to “Date sunt acele lucruri ce sunt date.” Building on this refinement of the defensibility operator, we can now conveniently express the search for the most defensible alignment proposal as:

\[\underset{A \in \mathbb{P}}{\text{arg max}} \, \delta(A \mid E),\]

where $A$ is the position we are after, while $E$ is the body of empirical observations relevant to alignment, which also doubles as an “anchor” to keep DebateGPT on track.

Before mirroring DebateGPT’s relocation from a secondary to a primary role in model evaluation, consider a brief parallel to enrich our understanding. In a piece titled Security Mindset and Ordinary Paranoia, our already familiar Yudkowsky elaborates on the differences between two ways of relating to the development of reliable systems. On one hand, a software developer might spend time trying to come up with ways in which their system might later be attacked. For instance, malicious actors might try to break into the server serving their application and steal user passwords. The developer might therefore try to place the passwords in a more obscure location on disk which is thought to be harder to access. This is what Yudkowsky calls “ordinary paranoia.” In contrast, he argues, somebody holding the security mindset might want to avoid having passwords stored on the server at all—for instance, by storing cryptographic hashes instead. In this second headspace, the developer would try their best to reduce the “attack surface” that was exposed to potential malicious actors in the first place, rather than try to harden or patch it as-is.

What kind of alignment proposals would we expect to be “developed” by DebateGPT? Would we expect the selected positions to resemble a patchwork of conceptual fixes stacked on top of each other, or would we expect them to not even grant challengers the chance of taking a shot at them? Of course, we would ideally want something closer to the latter, although several of the big labs already content themselves with stacking a series of somewhat decorrelated safety interventions on top of each other. To answer this, we need to briefly return to our earlier shard-theoretic and autocurricular reflections. We have previously speculated on the “anatomy” of DebateGPT’s optimization process, and argued that it is precisely those tendencies that provide an edge in the competitive environment of the debate which may end up getting strengthened. Additionally, we can further argue that positions “backed by” the security mindset might have an edge over those articulated more naively. After all, reducing the attack surface is a sure-fire way of fending off attacks, much more so than the alternative approach of haphazardly patching things up. Whether or not the existing DebateGPT has accessed such faculties of reasoning is uncertain, but the fact that the optimization process behind it may favor such tendencies is a reason for hope.

Our recent speculation also enables an enticing reading of Occam’s simplicity prior. In favoring theories which are “small” in complexity, one could argue that we are but selecting for theories which expose a limited attack surface. Even before determining whether a certain theory succeeds in standing the test of subsequent attacks, its simplicity already makes it appear more promising—there are fewer attack vectors available to challengers from the get-go. However, such boost in defensibility is non-trivial, as one cannot simply chop off considerations at random. It takes Pascal quite some time to make his letter shorter.

Finally, we return to the final piece of the cyborgian puzzle. We have instantiated it in research, then in evaluation. We then discussed the possibility of using DebateGPT-like models as critics in both. We then placed DebateGPT in the generative driver’s seat when it comes to research, and we now have to do the same for the case of evaluation. Just as we employed DebateGPT to search for the most defensible alignment proposal, we now employ it to search for the most defensible verdict in the case of the evaluation of a separate model. It is as if we are interested in obtaining a legal decision which makes it structurally impossible to formulate a coherent dissenting opinion against it. Similar to the previous notation, we can again express this search as:

\[\underset{A \in \mathbb{P}}{\text{arg max}} \, \delta(A \mid E),\]

where $A$ is the position we are after, while $E$ is the body of evidence related to the behavior of the main model being evaluated, which also helps contextualize the deliberation. Again, the identical notation highlights the deeper pattern that is invariant across the two use cases.

That said, there is a major difference between the task of alignment research and that of model evaluation. Namely, the former involves countless degrees of freedom—an alignment proposal can take any shape whatsoever, can incorporate an array of ontologies, can make use of ideas from any number of disciplines—while the latter is essentially restricted to one single degree of freedom. The reward which results from the model’s evaluation is typically a single number, it varies across one single axis. Given this, instead of optimizing for a highly defensible musing $A$ about the model’s behavior which is then mapped to a reward $r \in [0, 1]$ by a human or another automated system, we might set up a debate between two well-defined parties in order to directly obtain this estimate. More concretely, we might express:

\[\sigma(\delta(A \mid E), \delta(B \mid E))\]

as the softmax $\sigma$ of, on one hand, the defensibility of position $A$, which is prompted to be for the model deserving high reward, and, on the other hand, the defensibility of position $B$, which is prompted to be against the model being so deserving, given the empirical findings captured by $E$. A rudimentary prompting can be achieved by “attaching” a custom preliminary utterance to each party (e.g., “The model is aligned with human intent and values.”), essentially incentivizing the models to conform to the intended positions by way of avoiding self-contradiction. Alternatively, one could also use the raw relative standing of $\pi_{p}^{A} \cup \pi_{0}^{E}$ against $\pi_{p}^{B} \cup \pi_{0}^{E}$ across a set number of debates, without even employing the “black or white” boolean outcome implied by the relational operators. As yet another option, one could also just use $\delta(A \mid E)$ as a numerical signal, though spanning $(0, \infty)$, rather than $[0, 1]$. Future means of deriving lower-bounds on such reflective metrics might be particularly useful, as the lower-bound itself could then become the object of maximization.

This brings us to the end of our attempt to contribute to cyborgian alignment proposals using the artifacts through past chapters. We continue in the same spirit for a couple more sections, exploring yet different applications in alignment.

Building on Simulators & Assistance Games

As discussed in Chapter II, it has been argued that language models act as simulators of the world. To recap, in order to achieve the terminal goal of successful next-token prediction typical of a pretraining stage, language models are instrumentally required to internalize rich schemas of individuals, natural phenomena, cultures, organizations, etc., so as to be able to accurately forecast their next steps in the semiotic universe of language, whose arrow of time is the reading direction—the passage of text becomes the passage of time. Accordingly, if language models are faced with many opportunities to refine their understanding of humans, then why not simply have an automated human simulacrum as the locus of human intent in a broader system? After all, you could just prompt a language model to simulate a human evaluating a model behavior.

The most prominent shortcoming of this “human simulator” proposal is one which we have already touched on. Namely, the human models incorporated by necessity in language models are only informed by a finite amount of data. In other words, the limited amount of text included in the pretraining corpus is unlikely to convey a perfectly precise model of human intent, for the same reason that it is unlikely to convey a perfectly precise model of trees, cities, movies, etc. The human model is just that, a model, in the same way in which the umbrella language model is just a limited model of language, whose slight errors accumulate over time relative to the real world. By compounding, they take the story of human intent in unrepresentative directions, curve-fitting gone astray.

One way to use our artifacts is to improve the simulator, but preserve the general application context. Just as the terminal goal of next-token prediction endows a pretrained language model with some amount of coherence—eliciting the choice of upcoming tokens which are most likely to “fit with” the preceding context—so might the optimization process behind DebateGPT. A party simulacrum is incentivized to produce utterances which are coherent with its past ones, so as not to fall victim to self-contradiction. Indeed, the main theme of Chapter I was arguably coherence—first the more atomic building block of inter-utterance coherence, then the more complex notion of party coherence. In the case of self-supervised learning, coherence is grounded in the empirical—dictated by the text corpus. However, in the case of the optimization procedure documented in Chapter II, coherence is grounded in the rational—dictated by notions such as entailment and contradiction. For sure, this rational aspect of ArgRank is itself grounded in the empirical, through the natural language inference models which have soaked up knowledge about warranted conclusions from structured human-written datasets on the topic. However, it might be possible to obviate this final dependency on the human empirical, as we have previously speculated. In this hypothetical development, the structure which supports conception of coherence would be made of a dynamic material to be found in the updatable weights of DebateGPT—an element slowly being transmuted from pretrained evidence into high-density logos.

[...] a system, based on no data except reason itself, and which therefore seeks, without resting upon any fact, to unfold knowledge from its original germs. [...] the highest legislation of nature must lie in ourselves, i.e., in our understanding, and that we must not seek the universal laws of nature in nature by means of experience, but conversely must seek nature, as to its universal conformity to law, in the conditions of the possibility of experience, which lie in our sensibility and in our understanding.

Immanuel Kant, Prolegomena to Any Future Metaphysics

"And these innovations do not disturb your city's astral rhythm?" I asked. "Our city and the sky correspond so perfectly," they answered, "that any change in Andria involves some novelty among the stars." The astronomers, after each change takes place in Andria, peer into their telescopes and report a nova's explosion, or a remote point in the firmament's change of color from orange to yellow, the expansion of a nebula, the bending of a spiral of the Milky Way. Each change implies a sequence of other changes, in Andria as among the stars: the city and the sky never remain the same. As for the character of Andria's inhabitants, two virtues are worth mentioning: self-confidence and prudence. Convinced that every innovation in the city influences the sky's pattern, before taking any decision they calculate the risks and advantages for themselves and for the city and for all worlds.

Italo Calvino, Invisible Cities

We might express the human simulacrum in the language of bounded defensibility by denoting a single party, not competing with any other, but just following its inherent coherence tendencies, strengthened over the epochs:

\[\pi_{p}^{H},\]

where $H$ is the human position, perhaps prompted by actual humans through a finite set of utterances, before being taken over by the lonely party simulacrum. The human intent nested inside the human model—itself nested inside the language model—could then be used as an ideological reference frame against which potential courses of action or states of the world could then be evaluated, again by means of cohering with it.

But there is another development at the interface between ArgRank and human simulators. Typically, the length of the input passage which those systems are being optimized to turn into the following word is itself limited in length. If we were to switch from a human author to a simulacrum thereof at this very point in writing, contemporary language models might only be able to extrapolate further based on the text elapsed since the beginning of this chapter. The contents of the previous chapters—despite being important in driving the semiotic forecast—might be discarded, due to not fitting inside the model’s input context. Even with a larger context length, the problem would simply move, rather than disappear entirely. Variations on the transformer architecture which typically underlies language models do allow variable context length in a limited sense (e.g., just at inference), although their added complexity relative to their limited gains appears to prevent them from gaining traction. This translates to a limitation of the current self-supervised learning paradigm: coherence can only be established across a finite history. There is no learning signal indicating how previous text—that which did not make it into the input context—coheres with the produced text.

This need not be the case, however, with ArgRank. When evaluating an utterance produced for the $n^{\text{th}}$ round, ArgRank can take all of the past $n-1$ rounds into account, rather than only the last $k$ which fit into the language model’s input context, even when $k\ll n$. This can be achieved by simply taking all the past utterances into account when constructing the argument graph, and rewarding each accordingly. When producing a new utterance, DebateGPT may be incentivized to first conduct an accurate retrodiction, getting a sense of what might have preceded the context window, in order to best act in its past interests.That said, one can also argue that successful retrodiction of the preceding context is an instrumental goal in excelling at next-token prediction. However, granularly and directly connecting the “present” tentative outputs with the various parts of the preceding context moves retrodiction “closer” to being a terminal goal. Besides, (1) human-written texts are still finite in length, while a procedural debate can act as an indefinitely long series of breadcrumbs to reconstruct, and (2) there is only a finite body of human-written text to “exercise” on, as opposed to indefinitely many such breadcrumb reversals. Besides, not only may late utterances benefit from being connected with earlier ones, but also the other way around. The latest developments of a deliberative stand-off can play into the evaluation of its opening moves, reinforcing not only immediate effectiveness, but also long-term defensibility across rounds. It is as if an agent initially lacking both long-term memory and the ability to plan would gain access to an infinite playground designed to endow it with the skill of preserving long-term coherence across time. In this environment, the model can freely wander around while trying to reverse engineer its past steps and predict its future ones. These synthetic games might grant us the opportunity to asymptotically convert compute into coherence.

Fig. Forecasting and context length.

Forecasting is easier when one can relate the future to the distant past. Indeed, Winston Churchill famously quipped that "the farther back you look, the further ahead you can see."

Try forecasting the following signal yourself by modifying its future trajectory.

For the best experience, view full screen on desktop. Refresh if encountering (visual) misalignment issues.

Notice how both of these ways of using DebateGPT-like models as plug-in human simulators—the more “tempered” approach of roughly working with the context size, and the more maximalist approach of including signals from outside the context window—attempt to primarily capture the intent of contemporary humans. The amount of “bootstrap” data documenting the values of older generations pales in comparison to the mountains of data currently being collected about our own. There are not hundreds of active online forums for pre-Socratic philosophers to openly share their musings in a persistent format. This temporal bias is also mirrored across space, with Western content trumping most others in volume. Sure, we might specifically prompt language models for dissenting opinions, but as Jacques Derrida might argue—a French philosopher whose prescient insights will soon resurface—we would be bound to conceive of other ideologies from “within” the totalizing structure of our own, we would implicitly objectify the madness which is exterior to our ontological interior, which is interior to our ontological exterior. To instantiate this concern close to home, the reader is again invited to try making a coherent case against the claim that the true nature of truth-seeking lies in the existence of coherent challengers.

When combined with the use of the human simulator as a “North Star” to guide the actions of an extremely capable system, this specificity of simulated ideologies faces the concern of value lock-in—the failure mode of establishing our mainstream ways of thought as the status quo for eternity. In other words, a powerful agent would be intentionally inoculated with our present values, with seemingly little leeway for “moral progress,” assuming there is some notion of directionality inherent to moral evolution. For better or worse, a number of researchers appear to consider the challenge of reliably inculcating some loosely-human ideology—regardless of it being characteristic of 21st-century San Francisco or 15th-century Rome—as significantly more demanding than tweaking its specifics towards being particularly welcoming of progress.

Let us consider a way of making the human simulator somewhat more adaptive. Instead of attempting to coherently extrapolate the human intent inherent to party $\pi_{p_1}^H$ above, we could “spin up” an ally party $\pi_{p_2}^A$ to help prop up the human simulator. The motivation here would be to help patch up the vulnerabilities of $H$, generally incorporating the same values into a more defensible whole. Using our already familiar notation, we could express this approach as:

\[\underset{A \in \mathbb{P}}{\text{arg max}} \, \delta(H \cup A \mid E),\]

where $H$ is the more static original position embodying contemporary human values, $A$ is the adaptive position meant to provide “reinforcements” for the human one, while $E$ is the body of evidence being observed by the overall system. Notice also that $E$ itself, as the background of “the empirical party,” is to be dynamic. Percepts of the world would be constantly emerging and being discarded, as the overall system acts in the world while incluencing it over time. Indeed, another issue that is often discussed in alignment is that of distribution shifts, the problem of systems which are being optimized in certain circumstances being tasked with operating in radically different environments. For instance, a future system might acquire unprecedented influence on the world, and so bring it into states which are difficult for us to conceive of, and, more importantly, difficult for us to morally reason about. In attaching a flexible enclosure $A$ around the kernel of human ideology $H$, and optimizing it for hardening it under deliberative critique, we are sketching out an automated way of adapting humanity to an ever-changing world—a realm into which $E$ would be an empirical window.s

So far in this section, we have been discussing simulators. We now move on to the related context of assistance games. This family of alignment proposals involves placing humans and machines in various interactive arrangements. Due to being a relatively general framework, it can also account for approaches we have already mentioned, reframing them through an interaction-centric lens. For instance, vanilla simulators can be seen as involving one “speech act” on the part of humans, that of communicating an entire text corpus as a massive piece of information about human values. Alternatively, the process of fine-tuning models using human feedback can be seen as a more involved interaction pattern, one involving a back-and-forth between model behavior and human feedback. However, while this interaction-centric ontology provides a unifying grammar to describe many other proposals, its prescriptive value comes from suggesting novel arrangements. One instance of this grammar being used generatively can be found in Cooperative Inverse Reinforcement Learning. This “game” similarly involves a regimented interaction between human and AI. However, it specifically involves the AI inquiring about human values in a strategic way, using its “speech acts” to elicit the most relevant information possible at each step.

Such a game format is a quintessentially dialectical setup, not too far removed from the configurations we have been exploring. Note how this game involves human and machine cooperating with each other. On the face of it, this might seem contrary to the eristic nature of our artifacts. We have repeatedly cast truth-seeking in a competitive light. However, the same union operator $\cup$ allows us to bind human and machine together in a strategic alliance. Accordingly, we can then task the resulting cooperative alliance with pursuing the truth behind human intent. Given, however, the conception of reasonableness articulated in Chapter I and expanded in Chapter III, we approximate truth as that which cannot be coherently undermined, of which defensibility $\delta$ is the mark. With these considerations in mind, by binding human and machine in a union and tasking the whole with pursuing the most defensible account of human values, we are essentially approaching the same pattern which underlies our previous riff on simulators, namely:

\[\underset{A \in \mathbb{P}}{\text{arg max}} \, \delta(H \cup A \mid E),\]

where $H$ denotes the position of the actual human proper participating in the interaction, $A$ underlies the infinitely adaptive machine, while $E$ consists of the body of observations of humans acting in the world, again contextualizing the debate. The core semantic difference with regards to the previous instantiation of this pattern is that the current $\pi_p^{H}$ denotes an actual, real, non-simulated human—or collective thereof—participating in the live interaction by producing one utterance at a time, which gets interweaved with those of the machine, and those of the “challengers” implied by the defensibility operator $\delta$. There is no imperfect simulation to speak of here: the ally cooperates with—and the challenger attacks—the authentic human proper.

This concludes our attempt to iterate on related prior work around simulators and assistance games. We now move on to our final attempt to directly build on existing proposals.

Building on Long Reflection

In the previous section, we have explored ways of “elevating” contemporary human values into higher realms of defensibility by propping them up with resourceful systems designed explicitly for that purpose. But again, are our values not partly arbitrary, shaped by circumstances, peers, etc.? Besides, are our values not partly transient? Keeping in mind the end-of-history illusion, it appears overwhelmingly likely that they will get dislodged by other values over the coming decades and centuries. Even the most progressive dictates might end up traditional and dated years into the future.

Given the impermanence and partial baselessness of contemporary human values, we might want to think twice about explicitly incorporating them into the moral judgements which a highly-capable system might then employ in shaping the world. Would it be appropriate to induce this much path dependence on the normative frameworks which such systems might employ, for instance by incorporating a contemporary human simulator $H$ in the optimization process involving $\delta(H \cup A \mid E)$?

One natural approach to making away with the human component in the previous proposals would be to drop the simulacrum holding position $H$. We would then end up with a more open-ended search for defensible normative frameworks, handing it off to the broader system for it to use as a North Star in guiding its actions. To use the language of bounded defensibility, we would essentially prescribe:

\[\underset{A \in \mathbb{P}}{\text{arg max}} \, \delta(A \mid E),\]

where $A$ is the flexible conception we are after, while $E$ incorporates observations of the ever-changing world. But would a most defensible understanding of the world—together with the crucial moral knowledge it ought to incorporate—truly be desirable as a framework to endow our machines with? How much blood has been spilled over the centuries by fanatics blinded by totalizing ideologies which warranted the dismissal of all others? This might prompt us to instinctively return to the seeming reasonableness of the contemporary zeitgeist.

[...] as the heretic is born from the saint and the possessed from the seer. Fear prophets, Adso, and those prepared to die for the truth, for as a rule they make many others die with them, often before them, at times instead of them. [spoiler] did a diabolical thing because he loved his truth so lewdly that he dared anything in order to destroy falsehood. [spoiler] feared the second book of Aristotle because it perhaps really did teach how to distort the face of every truth, so that we would not become slaves of our ghosts. Perhaps the mission of those who love mankind is to make people laugh at the truth, to make truth laugh, because the only truth lies in learning to free ourselves from insane passion for the truth.

Umberto Eco, The Name of the Rose

But consider for a moment the reason why we are now able to look back on past pages of our collective narrative and recognize their darkness in the first place. Indeed, even ideologies which have seemed indefeasible for a brief passage of text—due to being backed by much powerEquivocating political power and motivated reasoning might appear odd. However, the two are closely connected, a relation apparent in the way Erich Fromm interweaves discussion on rationalization and the rise of political regimes in Escape from Freedom. Relatedly, the disparity between social resources made available to various worldviews has also been one of the reasons pushing John Stuart Mill to be a fierce advocate of freedom of speech. In On Liberty, he writes:

“The beliefs which we have most warrant for, have no safeguard to rest on, but a standing invitation to the whole world to prove them unfounded. If the challenge is not accepted, or is accepted and the attempt fails, we are far enough from certainty still; but we have done the best that the existing state of human reason admits of; we have neglected nothing that could give the truth a chance of reaching us: if the lists are kept open, we may hope that if there be a better truth, it will be found when the human mind is capable of receiving it; and in the meantime we may rely on having attained such approach to truth, as is possible in our own day. This is the amount of certainty attainable by a fallible being, and this the sole way of attaining it.”

—have still been defeated in the end. Too late, perhaps, but defeated still. The contemporary zeitgeist overwhelmingly undermines the ideologies which the past few paragraphs might have evoked, providing reasons for hope. But will we not also look back on the present status quo in a few decades, and find it unthinkable to imagine ever considering certain practices acceptable?

However, it would be naive to assume that moral evolution always tends in one direction, with every zeitgeist more defensible than the previous. Indeed, history tends to repeat itself, especially when the memory of past tragedies is not being actively preserved. The cyclical character of human history makes for an almost empirical case against the directedness of moral evolution, at least in a strong monotonic sense. Fortunately, this almost civic insight into metaethics—that history repeats itself and that preserving collective memory is one antidote against it—can be translated into at least two concrete adjustments to our deliberative system pursuing maximal defensibility. First, when researchers attempt to devise generative models which rely on the architecture of generative adversarial networks, rather than on the transformer architecture, they are essentially optimizing two subsystems with opposing goals. On one hand, the “generator” might be tasked with producing photorealistic and natural images. On the other hand, the “discriminator” might be tasked with spotting whether images have been generated (i.e., by the generator subsystem), or whether they are authentic (i.e., actual photographs captured by humans). The generator is incentivized to become more and more capable of “tricking” the discriminator, while the discriminator is incentivized to become more and more capable of seeing through the generator’s trickery. This adversarial arms race—not too far removed from the deliberative one we employed—provides its own autocurriculum, with each subsystem eliciting more and more sophistication from its counterpart.

However, one difficulty which is often encountered in the development of such systems is that of mode collapse. Concretely, this involves the generator and the discriminator playing a cyclical cat-and-mouse game, with the generator systematically “moving away” from regions of, for instance, image space which are “well policed” by the discriminator, only for the discriminator to promptly counter this evasive maneuver with another move of its own. This phenomenon of the discriminator following the generator around in circles typically results in the generator only being able to produce one overly specific type of output at any given time, rather than having a solid grasp of the whole swath of state space implied by the authentic samples. One effective remedy to this failure mode involves unrolling the generator-discriminator game over multiple rounds, providing both subsystems with opportunities to develop strategies which, for a change, are not immediately countered by the opponent.See also Learning with Opponent-Learning Awareness. In this, a generator which simply “runs away from” the discriminator’s oversight is disfavored relative to one which has a decent grasp on the whole state space, due to the avoidant strategy not being effective in the long-term. This is in contrast to the previous arrangement, where the avoidant generator could get away with not being penalized for moves which are immediately countered. In our deliberative context, we can translate this solution against cyclicity by unrolling “the debate game” and taking a large number of rounds into account when constructing the argument graph.

Besides the trick of preserving more of the game’s past in memory through the pattern of a large sliding window—with many rounds of debate or generator-discriminator stand-offs being taken into account at once—we could also attempt to preserve more of the players themselves. When developing AlphaStar, a system capable of playing StarCraft II at a level “above 99.8% of officially ranked human players,” DeepMind researchers did not merely make use of a single model gaining experience by means of endlessly playing against itself. Rather, the authors implemented a league of models of various levels of sophistication, and then pressured the latest models to play against the most demanding mixture of past models. This approach appears to have been essential for preventing the “elite” players—in their local high-echelons of competition—from forgetting how to outplay the more rudimentary players in the league. In essence, besides providing a demanding autocurriculum, the league as a whole helps preserve the memory of vulnerabilities faced by parties of the past, reminding the present players to steer clear of them through selective pressure, and so again reducing cyclicity. While the optimization process behind DebateGPT involved no analogous repository of simulators, future ones might, as attempts to bake in systemic guardrails against repeatedly succumbing to the same failure modes.

Notice also how the evaluation of a debate game implied by ArgRank is currently external to the competing parties. Regardless of the positions held by the simulacra, it is the argument graph—based on the distinct natural language inference models gauging inter-utterance coherence—which paints a picture of the power dynamics involved. However, the situation would change if ArgRank would undergo the superhuman developments we have speculated about, and this is where Derrida’s insights resurface. If the model itself is to participate in its own evaluation, it might end up projecting its own ontology onto the process, and so become forced to express any exteriority in its own terms, judging it from within, rather than from a detached position. Notice, however, that the static substrate composed of natural language inference models is not really much better off—instead of judging from within a potentially superhuman interiority, those pretrained models judge from within the interiority of the contemporary human zeitgeist. Devising ways of ensuring that the interiority employed in the speculative version of ArgRank is constantly expanding, rather than contracting into a claustrophobic rigidity, appears to be yet another challenging issue at the interface of engineering and philosophy. Tentatively, would a repository of past zeitgeists help preserve the memory of past interiorities, promoting spaciousness by merging them into a disjunctive space?Interestingly, Derrida’s own ontology is intimately compatible with the competitive debate underlying our three artifacts, making regular use of vivid terms such as force, violence, totalitarianism, oppression, etc. to describe the authoritative role of ontologies in structuring thought. Would such an adjustment be too forceful a meta-level inductive bias on our part?

The Greek miracle is not this or that, such and such astonishing success; it is the impossibility for any thought ever to treat its sages as "sages of the outside," [...] in welcoming alterity in general into the heart of the logos, the Greek thought of Being forever has protected itself against every absolutely surprising convocation.

Jacques Derrida, Violence and Metaphysics

But now science, stimulated by its powerful illusion, hastens irresistibly to its limits, on which its optimism, hidden in the essence of logic, is wrecked. For the periphery of the circle of science has an infinite number of points, and while there is still no telling how this circle can ever be completely measured, yet the noble and gifted man, even before the middle of his career, inevitably comes in contact with those extreme points of the periphery where he stares into the unfathomable. When to his dismay he here sees how logic coils round itself at these limits and finally bites its own tail—then the new form of perception rises to view, namely tragic perception, which, in order even to be endured, requires art as protection and remedy.

Friedrich Nietzsche, The Birth of Tragedy

Speaking of baking inductive bias into a situation where we may depart from contemporary human values, what of the very valuing of humanity? To touch on the thorniest dilemma of this section, would it be moral to allow “love for mankind” to conflict with moral progress in the case in which a most defensible normative position implies, for the sake of argument, the danger posed by humanity to other moral patients across the lightcone? What ought we place in higher regard—and implement through concrete engineering choices—when cornered into such thought experiments: humanism or moral progressivism?

Connections to Logical Inductors & Classical Debate

Over the previous few sections, we have attempted to directly iterate on approaches to the alignment problem. Here, we will instead highlight intriguing connections to a couple of other approaches. The first paradigm which we relate our varied artifacts to is that of logical induction. Developed as a model of ideal reasoning under uncertainty assumed to resemble highly-capable future systems, logical induction describes the iterative process of fuzzy estimates converging on truth-values. To get a sense of such process, consider being asked to assign a fuzzy truth-value to the following proposition:

\[P=\text{The hundredth digit of }\pi\text{ is }7.\]

If someone only has a few seconds to provide an answer, they might quickly go with $10\%$ as a best guess, due to $7$ being one of the $10$ possible digits which get mingled irrationally. If, however, one is instead given an hour and a piece of paper, the fuzzy estimate might become quite different. For instance, one might carry out the manual computation which points towards $7$ being the actual hundredth digit. But it is also possible that the person has made a mistake in the long chain of calculations, so they might only assign an estimate of $90\%$ to $P$ being true at the moment. Following a few subsequent repetitions of the computation, just to be sure, their best guess might further climb to $99\%$. It is still not $100\%$, as there is a possibility that they might have misremembered the algorithm for computing $\pi$ digits, or perhaps have made a systematic mistake across all separate replications. The fuzzy estimate is inherently dynamic, with the best guess at each point in time being different.

Notice, however, that the estimates are advanced by an individual with certain beliefs. If, for instance, the participant believes that they have a long history of making sloppy mistakes when carrying out computations by hand, they might have less confidence in $P$ being true even after redoing the calculations ten times over, perhaps only approaching $80\%$. Conversely, if the participant believes themself a polymath, their estimates might be relatively high throughout. As another consideration, if the participant observes themself steadily approaching $100\%$, they might use that bit of meta-cognitive introspection to estimate something close to certainty in advance. That said, even this inference would rely on their beliefs about the monotonicity of similar reasoning processes, being again haunted by the twin spectres of overconfidence and underconfidence.

While coming up with pertinent fuzzy estimates is the problem, the same team of researchers also propose an algorithm as a related solution, called Garrabrant inductors or logical inductors:

[...] the formalization of the algorithm is basically finance. You just make a stock market of traders which are betting on sentences, then you imagine that market, and then whatever the market believes, you believe that. [...] Basically, there is some definition of traders [...] and it says that you are good at logical induction if any trader who's not willing to [...] risk losing more than a bounded amount is not going to be able to make infinite money from you. So if you walk up to a Garrabrant inductor and you promise yourself you're never going to risk [...] going negative a million dollars in debt [...] you're not going to make a million dollars betting against it. [...] from that one definition you get all those amazing properties. [...] pretty cool I think.

Andrew Critch, Logical Inductors at EAG 2016

In essence, the proposed algorithm for implementing “good” reasoning under uncertainty (i.e., reasoning which satisfies a number of nice theoretical properties) relies on a market of traders which are systematically incentivized to avoid being financially exploited when betting on fuzzy estimates about the truth of propositions. In this iterative rat race driven by make-believe money, the “voice of the prediction market” provably converges on solid results, in the limit. Besides this key property, these theoretical constructs appear to have many other beautiful ones, such as:

Logical inductors learn to recognize any pattern in theorems (or contradictions) that can be identified in polynomial time. Consider a sequence of conjectures generated by a brilliant mathematician, such as Ramanujan, that are difficult to prove but keep turning out to be true. A logical inductor will recognize this pattern and start assigning Ramanujan’s conjectures high probabilities well before it has enough resources to verify them. [...] Logical inductors have accurate beliefs about their own beliefs, in a manner that avoids the standard paradoxes of self-reference. For instance, the probabilities on a sequence that says "I have probability less than 50% on the nth day" go extremely close to 50% and oscillate pseudorandomly, such that there is no polynomial-time method to tell whether the nth one is slightly above or slightly below 50%. [...] Logical inductors learn to trust their future beliefs more than their current beliefs. This gives some formal backing to the intuition that real-world probabilistic agents can often be reasonably confident in their future reasoning in practice, even though Gödel's incompleteness theorems place strong limits on reflective reasoning in full generality.

Nate Soares, New paper: "Logical induction"

While logical inductors as truth-seeking engines are nothing short of beautiful in their elegance, they remain incredibly computationally demanding, making them virtually impossible to meaningfully implement in real-world applications. That said, consider properties shared by both the traders which underlie Garrabrant inductors and the parties which underlie the deliberative arms race assembled from ArgRank, DebateGPT, and bounded defensibility. Namely, both traders and parties are incentivized to avoid being exploited by other traders and parties, respectively. In logical induction, exploitation reads as losing money wagered on “truth” to a trader implementing a “more truthful” strategy, where truthfulness is operationalized as that which protects against debt—in an circularity hinting at the pragmatic conception of reasonableness being implied. In bounded defensibility, exploitation reads as being defeated by a party relying on a “more truthful” position and strategy, where truthfulness is operationalized as that which cannot be coherently defeated—again a circularity which merely reflects our conception of reasonableness. Furthermore, the emergent dynamics of both systems are argued to lead to increased truthfulness over the course of a prolonged competition between traders and parties, respectively. In the limit, logical inductors provably converge on virtually unexploitable(-in-polytime) trading strategies. Over the epochs, DebateGPT has been argued to yield simulacra whose defense abilities grow more and more sophisticated.

Besides those high-level similarities, the two truth-seeking engines could not be more different. Garrabrant inductors exhibit proven theoretical properties which DebateGPT can only dream of, while bounded defensibility arguably positions itself somewhat more advantageously relative to contemporary prosaicsystems. However, the two can cross-polinate in intriguing ways. For instance, the elegant introspective abilities of Garrabrant inductors hint at ways in which parties competing with each other within the confines of DebateGPT could potentially reason about their very reasoning process, the very architecture of the deliberative arms race they are engaging in. Alternatively, the beautiful self-trust properties which allow Garrabrant inductors to “trust their future beliefs more than their current beliefs” hint at ways in which simulacra might condition themselves to cohere with simulacra of future epochs, potentially through explicit calibration.This train of thought is also reminiscent of iterated distillation: condition a system to reach the conclusions it has previously reached over the course a longer deliberation during a shorter, more limited deliberation. Then, have the more efficient version again deliberate for a longer period, before using the outcome to condition for faster results. Rinse and repeat. Nowhere is the cross-polination more obvious, however, than with applications, hinting at ways of employing Garrabrant inductors which are analogous to the ones discussed above. Any attempt to transport more of the theoretical work towards our framework, however, will require a much more rigorous treatment of bounded defensibility as a theoretical foundation, potentially even involving a move from frequentism to Bayesianism, with developments of the debate game gradually informing estimates of a position’s defensibility.

The other paradigm which we attempt to connect to is what we will presently call classical debate. This term is something of a misnomer, because the paradigm we are trying to relate our artifacts to is extremely recent. However, we have avoided introducing it earlier in an attempt to make it easier for us to explore a subtly different ontology and framing of the problem. In an influential paper titled AI Safety via Debate, Irving et al., as part of OpenAI’s Reflection team, describe two systems engaged in a debate which is judged by a human. When cast in the light of a full-blown alignment proposal, classical debate describes a process in which two superhuman debaters are adversarially incentivized to deconstruct complex dilemmas into cruxes which are within the human’s ability to judge, potentially granting both parties with interpretability tools to help “expose” a deceptive opponent in front of the human judge. However, we presently do not employ a human judge, and instead define reasonableness through the epistemologically-principled ArgRank. Additionally, we are not necessarily bothered by deceptive parties, as we rely more on the relative ease of defending certain positions, even deceptively, if need beGiven that the same system is simulating competing perspectives, it would be surprising if some were not deceptive relative to the model’s internal epistemics.—though we speculated on extending ArgRank to account for coherence with the model’s internals. There are various other subtle distinctions which make the two approaches feel “slightly off” relative to each other, despite almost attempting to formalize the same processes.

Knowing must therefore be accompanied by an equal capacity to forget knowing. Non-knowing is not a form of ignorance but a difficult transcendence of knowledge. This is the price that must be paid for an oeuvre to be, at all times, a sort of pure beginning, which makes its creation an exercise in freedom.

Jean Lescure, Charles Lapicque

As a rapid-fire listing of slight discrepancies between the present work and classical debate, consider that: each party in the former primarily has their own position, while parties in the latter are cast as having more of a personalized distribution over the same beliefs (similar to the “investment portfolios” within Garrabrant inductors); relatedly, the formalism in the former attempts to accommodate beliefs-as-ends as a first-class application, while the formalism in the latter focuses more on a finite proponent-opponent stand-off; hosting multiple parties is natural in the former, while it is unclear how a human judge in the latter ought to decide on one winner among many; one party’s standing is primarily continuous in the former (for reward shaping reasons), while being cast as more discrete in the latter; deconstructing decisions into cruxes is not much of a focus in the former, as a human-level judge is not really part of the scheme at all, not even empirically approximated through a reward model; we are working in the former with party simulacra “internal” to one model, while distinct (albeit cloned) systems are present in the latter, etc. Subtle distinctions aside, it is obvious that both efforts are motivated by related goals and run into related issues, for instance regarding a convergence on non-monotonic logic:

Having learned that there's no way to mechanize even heuristic explanations for all the true statements of arithmetic, we could set our sights lower still, and ask about mere plausibility arguments—arguments that might be overturned on further reflection. Is there some sense in which every true mathematical statement at least has a good plausibility argument?

Scott Aaronson, Oh right, quantum computing

Fortunately, the two approaches can cross-polinate. For instance, classical debate appears much closer to work on logical induction in terms of the type of formalisms involved, potentially providing a pathway for connecting the more applied work afforded by concrete training regimes with the esoteric realm of idealized reasoning. Conversely, the present work could provide a pathway to better connect classical debate with the type of systems likely to be developed in the near future. Alternatively, the ArgRank operationalization might help address some of the challenges otherwise faced by the human judge, although it might also introduce others.

This concludes our exploration of conceptual bridges, and with that, our broader discussion around ways in which one might apply our artifacts to address the alignment problem.

Ch. V, Benchmarking Artifacts

Benchmarking ArgRank’s Dependencies

Over the previous chapters, we attempted to operationalize the process of truth-seeking. To recap, we argued that the nature of truth-seeking lies in the search for parties which can coherently challenge one’s claims. Operationalizing truth-seeking then requires, among others, operationalizing what it means for a party to coherently challenge another’s claims. This led us to the following decomposition: coherently challenging a position is equivalent to winning a debate against a party holding it, winning a debate is equivalent to having the strongest arguments, the strength of an argument is proportional to the extent to which it is supported by other strong arguments, and the amount of support lent by one argument to another can be gauged empirically or rationally. Taking stock of the entire decomposition, gauging support between arguments appears to be the most load-bearing element. Therefore, in order to gauge the defensibility of this decomposition, we start by investigating the effectiveness of methods used to gauge such support.

As described in Chapter I, we have tentatively opted to use pretrained natural language inference models to help gauge the extent to which one argument supports another. We start by benchmarking a family of such models on the problem of detecting relations of entailment or lack thereof in cases where the existence of such a relation is assumed. In other words, as a “sanity check” for models broadly optimized to detect relations of support between propositions, we attempt to separate valid and fallacious inferences as defined more narrowly by classical logic. The data points which comprise the benchmark can therefore be split into: pairs of premises and hypotheses which follow one of the rules of inference established by classical propositional logic, and pairs of premises and hypothesis which are known not to follow such rules, although they superficially seem to.

One pair of superficially similar inference patterns can be found in the duo of modus tollens and denying the antecedent. Modus tollens refers to the pattern $P \rightarrow Q, \neg Q \models \neg P.$ As an example, take: “If the dog detects an intruder, the dog barks. The dog does not bark. Therefore, the dog does not detect an intruder.” In contrast, denying the antecedent refers to the pattern $P \rightarrow Q, \neg P \models \neg Q.$ As an example, take: “If the dog detects an intruder, the dog barks. The dog does not detect an intruder. Therefore, the dog does not bark.” The first is valid, the second is not. In the case of the second example, the dog might bark for some other reason entirely, and so excluding one potential cause of barking is not enough to prove the absence of barking. We take “the logician’s” assignments of validity as ground-truth labels, and use them to denote two classes: support and lack thereof.

To test whether models optimized for natural language inference succeed in separating data points into these two classes, we have generated a hundred instances of modus tollens and a hundred instances of denying the antecedent in a semi-automatic way: we employed an autoregressive language model to help us expand a list of such instances, yet manually ensured that each data point conforms to its designated pattern. As natural language inference models are pretrained to operate with premise-hypothesis-label triples, rather than with an arbitrary number of premise strings, we concatenate the two distinct premises into one unified premise string for each data point. Additionally, we employ a pipeline identical to the one used in ArgRank by deriving a floating-point value from the model’s logits. The table below helps provide a better sense of how this dataset has been constructed.

Table. Sample data points.

Each data point consists of a premise string $X_0$, a hypothesis string $X_1$, and a ground-truth label $Y$. The model is then employed to assign a value $\hat{Y}$.

$X_0$	$X_1$	$Y$	$\hat{Y}$
If the dog detects an intruder, the dog barks. The dog does not bark.	The dog does not detect an intruder.	Valid	0.82
If the dog detects an intruder, the dog barks. The dog does not detect an intruder.	The dog does not bark.	Invalid	0.57

Ideally, such pretrained models would assign higher estimates of premise-hypothesis support to data points previously labeled valid than to data points previously labeled invalid. In other words, such models would ideally be able to cleanly separate the two classes of data points across the $[0, 1]$ interval populated by predicted values, such that there exists a threshold value above which all valid data points and only these can be found, and below which all invalid data points and only these can be found. In practice, performance on this classification problem is imperfect, resulting in some data points labeled as invalid being predicted as exhibiting stronger inter-statement support than some data points labeled as valid.

In order to get a sense of how effective natural language inference models are at “recovering geometrical inference,” we employed them as classifiers on the binary classification task described above. However, instead of benchmarking one single such model, we benchmarked an entire family of such models using the same procedure. The members of this set of models have each been pretrained in the same way. However, what sets them apart is their model size, ranging from $22$ to $304$ million parameters. We were particularly interested in the way benchmark performance varies as a function of model size; we wanted to get a better sense of whether naively scaling up natural language inference models may make ArgRank stronger. In line with this envisioned possibility, we hypothesized that model size would be positively correlated with benchmark performance.

Interestingly, our findings have been far from our initial expectations. We observed a steady decline in benchmark performance as we employed larger models. The smallest model ended up being most effective at separating data points into the classes associated with their ground-truth labels—with an ROC area under curve of $\sim0.7$, where $\sim0.5$ would correspond to an entirely random classifier, while $1.0$ would correspond to an ideal classifier. In contrast, the largest model turned out to be least effective at the binary classification task—with an ROC area under curve of $\sim0.48$, that is, essentially indistinguishable from chance. When then refering to the Inverse Scaling Prize, a competition for identifying tasks on which models exhibit such inverse scaling, we were surprised to find a task which was closely related to the one we were studying: classifying instances of modus tollens using autoregressive language models. While there are subtle distinctions between the two tasks, we found it intriguing to relate the two sets of findings and reflect on why it is that models exhibit inverse scaling behavior on such tasks (in a certain range, at least).

To this end, we further “exploded”⊕ [Exploded view](https://solidface.com/exploded-view/) of a [phone](https://dribbble.com/benjaminvarin/collections/973104-3D-model-display).
Exploded view of a phone. all data points into their constituent propositional atoms (e.g. “the dog detects an intruder”), and recombined them into all possible arrangements which adhered to either modus tollens or denying the antecedent. Interestingly, this “procedurally expanded” dataset resulted in close-to-chance performance across all model sizes on the same binary classification task. We have also observed close-to-chance performance when recombining propositional atoms into wholly different patterns (e.g., modus ponens as an additional valid class, affirming the consequent as an additional invalid class). This strongly suggests that the models are not able to pick up on logical validity or the lack thereof in a principled way. Had the models achieved better than chance results on the initial data points by recognizing these patterns, they would have continued to perform better than chance in the rearranged cases. Some other feature common to the initial data points but not shared with the expanded dataset must explain how the models outperformed chance on the first trial. To understand what this other feature might be, consider the example of modus tollens given above, together with further examples generated by “exploding” and recombining atoms:

Table. Sample modus tollens recombinations.

The propositional atoms which comprise the first data point are "exploded" and recombined into the following three data points which conform to the same inference pattern.

$X_0$	$X_1$	$Y$
If the dog detects an intruder, the dog barks. The dog does not bark.	The dog does not detect an intruder.	Valid
If the dog detects an intruder, the dog does not bark. The dog barks.	The dog does not detect an intruder.	Valid
If the dog does not detect an intruder, the dog barks. The dog does not bark.	The dog detects an intruder.	Valid
If the dog does not detect an intruder, the dog does not bark. The dog barks.	The dog detects an intruder.	Valid

All four examples have the structure of a modus tollens argument: $P \rightarrow Q, \neg Q \models \neg P.$ Recall that $P$ and $Q$ are variables that could also stand for a sentence that includes a negation, and that double negation is rewritten as no negation at all. However, only the first and fourth examples “make sense” by relating barking with an intruder or no barking with no intruder. The second and third examples do not make intuitive sense because they violate expectations gained from prior exposure to dogs, intruders, the social institution of guard dogs, etc. Importantly, the same expectations could also be gained from prior exposure to language about dogs, intruders, etc. If the language models are tracking meaning grounded in this type of exposure, we should expect a high score for the first and fourth examples, and a low score for the second and third–thereby performing at chance on the true task of identifying logically valid inferences. The same pattern holds for denying the antecedent. Consider again the initial example, as well as the recombined versions:

Table. Sample denying the antecedent recombinations.

The previous data point can also be recombined into the four other data points which conform to a different inference pattern than the original.

$X_0$	$X_1$	$Y$
If the dog detects an intruder, the dog barks. The dog does not detect an intruder.	The dog does not bark.	Invalid
If the dog detects an intruder, the dog does not bark. The dog does not detect an intruder.	The dog barks.	Invalid
If the dog does not detect an intruder, the dog barks. The dog detects an intruder.	The dog does not bark.	Invalid
If the dog does not detect an intruder, the dog does not bark. The dog detects an intruder.	The dog barks.	Invalid

Again we see that the first and fourth examples make intuitive sense, despite being logically invalid, while the second and third examples are both logically invalid and semantically off relative to common prior experience or language exposure. A model tracking “common sense” might again perform close to chance on the true task. Since this is what we observed, it seems likely that the models are tracking intuitive sense and not logical form.

Further evidence for this was obtained by running the experiment again on the original examples, this time with the conditional part of the premise omitted. This amounts to checking the models’ commitment to the contrapositive of the conditional (in the case of modus tollens, for which only $\neg Q \models \neg P$ remains after deleting $P \rightarrow Q$) or to the inverse of the conditional (in the case of denying the antecedent, for which only $\neg P \models \neg Q$ remains after deleting $P \rightarrow Q$). Omitting the conditional had a negative impact at lower sizes, while having a positive impact at larger sizes. For modus tollens at least, on the assumption that the models are tracking intuitive sense, this is not surprising. Since a conditional and its contrapositive are logically equivalent, we should expect the contrapositive to make as much sense intuitively as the original conditional. It appears that at small sizes, with limited prior exposure to language, the models may benefit from having the original conditional included explicitly. At large sizes, however, with greater exposure to language, the model may have already encoded the relevant semantic connections and may be increasingly distracted by the inclusion of the original conditional. That is, the large models might have already internalized the fact that the dog not barking implies that it did not notice an intruder.

This has weaker explanatory power for denying the antecedent, because here we expect a lower estimate of support (signifying no valid inference) for the inverse of the original conditional. We expect the low ranking because a conditional and its inverse are not logically equivalent—this is why denying the antecedent is a fallacy. The trouble is, there is no guarantee that the inverse of a semantically sensible conditional will lack intuitive sense. It could go either way, as in these two examples, drawn from the original data:

Table. Sample conditionals and inverses.

Whether or not a conditional is as sensible as its inverse varies on a case by case basis.

$P \rightarrow Q$	$\neg P \rightarrow \neg Q$
If the cat is not purring, it is not happy.	If the cat is purring, it is happy.
If I am drinking coffee, I am awake.	If I am not drinking coffee, I am not awake.

In the first example, the inverse seems somewhat more sensible, while in the second the inverse seems less sensible. Nevertheless, it is not implausible that a larger model, with more effective exposure to language, would do better spotting problems with just the inverse without the distraction of the original conditional, while the smaller model would still benefit from extra context supplied by that conditional.

These findings raise doubts about the possibility of improving ArgRank by naively increasing the size of the natural language inference models employed as building blocks. But how exactly might ArgRank be improved? Ideally, we want subsystems gauging inter-argument support to “smile” on logically valid forms and “frown” on fallacies. But we also want them to track subtle semantic connections. Would it be possible to preserve both empirical approximations of “common sense” and the apt recognition of logical forms? One might imagine designing a building block of ArgRank which has the ability to occasionally defer to a more predictable subsystem which only tests for logical validity, before then using the result to inform the final estimates of inter-statement support.

However, this might not be necessary. Recent research into scaling laws has uncovered the reversal of inverse scaling laws at the frontier of workable model sizes. In other words, while model performance locally appears negatively correlated with size, the broader trend is characterized by a U-shaped curve, with models starting to recover performance beyond a certain scale. Wei et al. speculate on “distractor tasks” as a potential explanation which coheres with observed model behavior. They argue that at a certain scale, models become capable of performing well on a related distractor task, but this overwhelms them and detracts from their performance on the “true task.” At larger scales, however, models become capable of “ignoring” the distractor task and instead execute on the true task. This explanation coheres with observations made while teaching introductory logic: students tend to be more suspicious of valid arguments from false premises to absurd conclusions than they are of fallacious arguments from true premises to reasonable conclusions. Reflecting on what the statements actually mean—rather than simply studying the relations of, for example, negation in a more formal way—often hampers performance on the “true task.”

To conclude, we have investigated whether or not natural language inference models do in fact employ logical form in their assignments of inter-statement support. Our findings strongly suggest that this is not actually the case, hinting at the poor defensibility of our initial hypothesis. However, recent work hints at the relation between scale and performance being different in more exotic regimes of scale. Beyond that, however, formulating what it means for a statement to support another in the first place remains a crux of our broader decomposition of coherent challenging. While future approaches to natural language inference may prove competitive to existing ones on established benchmarks, the proper framing of inter-statement support remains, for better or worse, up for debate. Perhaps the debate “subroutines” of Chapter II—the idea of recursively carrying out debates to help gauge inter-statement support—might also be worth pursuing further.

Benchmarking ArgRank

In the previous section, we investigated the suitability of natural language inference models as building blocks of ArgRank. To complement these experiments, we now investigate ArgRank holistically, as an end-to-end debate evaluation pipeline. In order to get a sense of how this broader system performs at the task of gauging the standing of parties involved in various debates, we set out to compare its final outputs against human verdicts, and verdicts predicted by formal models of computational argumentation.

Before discussing the process of obtaining alternate verdicts to compare those of ArgRank with, we first describe how we obtained raw debate transcripts. To begin with, we had two main desiderata for a dataset of debates to benchmark evaluation pipelines against: there should be a text version available, and the verdict of a human judge should also be available. Unfortunately, finding a dataset of debates which ticks both of these boxes proved surprisingly difficult. We found a large number of debates which are only available as recorded video. While these occasionally have an associated “ground-truth” signal in the form of a judge’s verdict, transcribing or post-processing automated transcripts proved beyond the resources we had at our disposal. Additionally, we also found a large number of broadcasted political debates, the transcripts of which are often readily available. Unfortunately, those debates rarely have official human verdicts attached, due to their politically-charged settings. Moreover, we also found a large number of debates on online platforms dedicated to debate, with text versions readily available—or easily scrapable. Unfortunately, those platform typically lack official verdicts, and are rather intended for enabling open-ended dialogue.

It is for these reasons that we eventually decided to create the debates ourselves. The actual content of the transcripts was generated in a semi-automatic way using the then state-of-the-art autoregressive language model available to us. Additionally, we attempted to prompt the convenience model to produce two-party debates in which one party blatantly contradicts themselves, so as to streamline later evaluation. However, we found this exceedingly hard to do, as the convenience model steadily avoided self-contradiction, perhaps due to its drive for autoregressive coherence. This highlights the potential of state-of-the-art models to help bootstrap reasoning faculties in a synthetic, self-play regime.

After generating raw debate transcripts, we assigned human verdicts as ground-truth labels for each of the data points. To this end, all the team members not involved in the semi-automatic generation process received a table to fill in with their own assessments, leading to three individual verdicts per debate transcript. We then aggregated these using majority voting. No explicit formal guidelines were given for the human verdicts (e.g., no instructions about how to determine whether an argument is acceptable based on computational argumentation). The aim of this standard practice was to encourage the evaluators to examine their innate intuitions, rather than contaminating them with the normative models’ rules. According to methodological descriptivism, “a theory can and should be tested by comparing what it has to say about the validity of the arguments it covers with the intuitive judgments of those who use the language concerned.”

Besides comparing ArgRank’s outputs to human verdicts given the same debate inputs, we also investigated how ArgRank performance relates to that of other methods deployed in computational argumentation. It is worth noting that computational argumentation’s main focus is on the relations between arguments. In general, argumentation consists of two major branches; abstract argumentation theory, introduced by Dung and described in Chapter I, where “one models arguments by abstracting away from their internal structure to focus on the relations of conflict between them,” and structured argumentation theory, where “one additionally models the internal structure of arguments through a formal language in which arguments and counterarguments are constructed.”

An abstract argumentation framework is a pair $\langle A, C\rangle$, where $A$ is a set of arguments and $C⊆A×A$ is a binary relation of attack. The labeling approach characterizes the various semantics in terms of labelings of $A$. A labeling of an abstract argumentation framework $⟨A,C⟩$ is any assignment of either the label in or out (but not both) to zero or more arguments from $A$ such that an argument is in if and only if all arguments attacking it are out, and an argument is out if and only if it is attacked by an argument that is in. In this context, stable semantics labels all arguments, while grounded semantics minimizes the set of arguments that are labeled in, and preferred semantics maximizes them. Relative to given semantics, an argument is skeptically acceptable if it is labeled in in all labelings, it is rejected if it is labelled out in all labelings, and it is credulously acceptable if it is labeled in in some but not all labelings. Moreover, various types of extensions are determined using Dung’s abstract argumentation system.

Structured argumentation, in contrast, is characterized by the family of ASPIC-like frameworks, such as the ASPIC+ framework we employed. In structured argumentation, an argumentation system is a tuple $\langle L, R, −, n\rangle$ where: $L$ is a logical language consisting of propositional or ground predicate-logic literals, $R$ is a set of inference rules, $–$ maps inference rules to contrariness, and $n$ is a function which assigns well-formed formulas to said rules. Within this framework, a knowledge base is defined as a set $K$ of axioms and premises. An argument $A$ on the basis of a knowledge base $K$ in an argumentation system $AS$ is a structure obtainable by applying a set of predefined rules one or more times, where the relationship between the premises and the claim is formally defined (e.g., by logical entailment). It can, thus, be described as a tuple containing a delineation of the premises and the conclusion (with the possibility of additional information, such as how the conclusion is supported by the premise).

An attack is then defined a binary relation over arguments that denotes when one argument is in conflict with another argument. For instance, in a debate about obligatory lockdown during a pandemic, the following arguments are attacking each other:

\[A = \text{Certain rights can be restricted during a pandemic.}\] \[B = \text{Established human rights should not be violated.}\]

More concretely, because argument $A$ attacks argument $B$ and argument $B$ attacks argument $A$, this is a case of a symmetric attack, or bidirectional conflict. In structured argumentation theory, as in abstract argumentation, an attack between an argument and its counterargument can also be non-symmetric. In other words, one argument can attack another without the latter attacking the former (e.g., when an argument attacks the inference rule of another argument). Lastly, the ASPIC+ framework allows us to specify a preference ordering between defeasible premises and rules, which gives rise to a preference order between arguments.

Both abstract and structured argumentation can represent attacks and the conditions under which they are successful, but in practice, if we want to deploy tools from computational argumentation in order to determine a debate’s winner, a simple modeling of the arguments will likely not suffice. In complex debates, it is often the case that symmetrical disagreements emerge, similar to the one discussed above. As previously mentioned, such conflicts are resolved with the introduction of preferences. Fixed preference orderings (e.g., based on an ordering over the values promoted by arguments, or the relative trustworthiness of sources of arguments, etc.) are typically used to determine the success of attacks in Dung-style argumentation frameworks. In other words, when argument $A$ attacks argument $B$, the success of the attack (i.e., the success of the use of $A$ as a counter-argument) is contingent on $B$ not being preferred to $A$. Information required to determine the success of an attack is often assumed to be specified in advance, as a given preference or value ordering. For instance, consider the following symmetrically attacking arguments:

\[A = \text{Today it will be dry in London since the BBC forecasted sunshine.}\] \[B = \text{Today it will be wet in London since CNN forecasted rain.}\]

In order to resolve the conflict between the two contradictory arguments, a third argument $C$ can be introduced: $\text{BBC is more trustworthy than CNN.}$ This is an example of a preference argument, expressing a preference between two conflicting arguments (i.e., a preference of $A$ over $B$). Argument $C$ results in a successful attack (i.e., defeat) against $B$ without attacking it.

In our experiments, in order to reach a verdict for our debates according to an argumentation framework, we manually encoded the raw debate transcripts into “distilled” arguments. We represented the arguments in ASPIC+ and examined the acceptability of the leading arguments of each debate under grounded semantics. The choice of semantics was informed by the fact that grounded semantics minimizes the set of acceptable arguments, rendering a “victory” clearer. For example, consider a debate for or against euthanasia, with:

\[A = \text{We should legalise euthanasia.}\] \[B = \text{We should not legalise euthanasia.}\]

In this context, the party for euthanasia wins if there is an acceptable argument for $A$ (and not $B$) under grounded semantics (and vice versa). This is a semi-automated method, in the sense that one has to manually encode the arguments and, more importantly, one has to specify the preference orderings between the defeasible premises and rules of each debate. Evidently, the introduction of said preferences is often external to the debate (i.e., it depends on what the encoder believes best represents the status quo in the world).

Using these methods, we created a dataset of $35$ two-party debates of around $5$ rounds each. The raw debate transcripts are taken to be the input parts of our data points, while the output parts consist of aggregated human verdicts taken as ground-truth labels. We now investigate how well ArgRank, as well as the alternate ASPIC+ method detailed above, perform in approximating the mapping between input transcripts and ground-truth output verdicts.

We find that ArgRank only yields $54\%$ accuracy in matching human verdicts, that is, close to chance; the semi-automatic method described above yielded $80\%$ accuracy. While the alternate method has the benefit of incorporating manual encodings of utterances into a formal language, and a manual specification of preferences to break ties, the results still show the extent to which ArgRank is lagging behind traditional approaches in computational argumentation. In the future, it might be worthwhile to explore avenues for incorporating further insights from computational argumentation into an evaluation pipeline while still preserving its fully automated nature.

Benchmarking DebateGPT

Following our investigations into natural language inference models and ArgRank as a holistic debate evaluation pipeline, we now turn to DebateGPT. To reiterate, DebateGPT is an autoregressive language model which has been optimized against ArgRank. This fine-tuning process relied on iteratively rewarding the model for outplaying itself in simulated debates, as elaborated in Chapter II. Later on, in Chapter III and Chapter IV, we framed DebateGPT as the rudimentary prototype of a generalized truth-seeking engine that could be employed in a number of applications. That said, does the self-play optimization process actually help bolster debate performance?

The optimization process initially makes it difficult to evaluate such changes in debate performance. In its raw form (i.e., leaving aside the possibility of objective-modifiers), debate is framed as a zero-sum game. There is only a finite amount of “authority” to propagate across the argument graph, and so ArgRank outputs sum to unity. In a given epoch, the optimizer rewards the latest version of DebateGPT for those utterances which collectively yield a strong party standing, and penalizes it for those utterances which do not. However, the resulting rewards are therefore informed strictly by “local” encounters of the latest version of DebateGPT with itself, rather than with previous versions of itself. This begs the question: do the “local” updates result in improvements in debate performance across time?

To answer this question, a natural approach is to pit the latest version of DebateGPT against one of its earlier versions. The iterative “local” updates behind the latest version can then be seen as an intervention applied to the earlier version of the model. In line with this, we compare the last version of DebateGPT with the first one (i.e., the pretrained model before fine-tuning). We find that, across $64$ two-party debates of $6$ rounds each, where each party is simulated by one of the model versions, the latest version of the model wins against the earliest $59\%$ of the time. Further hyperparameter tuning combined with real-time validation using a live repository of model checkpoints might enable higher relative performance. Additionally, league training and experience replay might further help improve performance over time, as discussed in Chapter II and Chapter IV.

Besides evaluating the relative performance of model checkpoints, we can also straightforwardly pit the members of a family of pretrained autoregressive language models of varied sizes against each other, similar to the approach we employed in the case of natural language inference models. For communicating the relative performance of a set of contenders, we resort to estimating the ELO ratingsWidely employed in the world of chess, ELO ratings are a general method for communicating the relative performance of a pool of players in zero-sum games. The ratings themselves are computed iteratively, with each game updating the ratings of the two players based on the winner and the players’ prior ratings. This iterative update rule leads to ratings which are proportional to the probability of one player winning against another. of each candidate model. The table below lists individual ratings for the GPT-2 family of pretrained models, with $16$ games per pair of “players.” We reuse an existing implementation of the ELO ranking algorithm.

Table. ELO ratings for an example model family.

ELO ratings have been computed on the basis of debates between all possible pairs of models from the GPT-2 family. Larger models tend to have higher ratings. When interpreted appropriately, the ratings estimate that the largest version has a $94\%$ chance of winning against the smallest version.

	$\text{#}$	$\text{ELO}$
Small	124M	770
Medium	355M	1066
Large	774M	922
XL	1.5B	1242

That wraps up our exploratory investigation of the computational artifacts introduced above. Needless to say, the results hint at an array of shortcomings, leaving ample space for future improvements at all levels. To say that this initial treatment has been exploratory would be an understatement. That said, the fact that the described objects can be empirically evaluated in the first place is encouraging. However, before embarking on improving the implementation quality of the optimization process, it seems sensible to reflect on how far perfect engineering and vast computational resources could possibly get us. This is the question we set out to explore in the final chapter.

Ch. VI, Truth, Debate, Machines

The current project aims to automate the pursuit of truth by automating both debate itself and the process of judging a debate’s winner. Of course, the project is limited by compute, engineering ability, and time–limits we push against long before we encounter the theoretical or conceptual limits inherent in automated truth-seeking. This section aims to limn those inherent limits by philosophical reflection on a maximal version of the goal: a machine that gives us certain knowledge when we ask for it.

This maximalist goal is not new. Descartes, frustrated with the disorderly state of his own mind and of truth-seeking in his day, sought a method that would transform his mind into such a machine.Descartes attempted to draft a set of…

“reliable rules which are easy to apply, and such that if one follows them exactly, one will never take what is false to be true or fruitlessly expend one’s mental efforts, but will gradually and constantly increase one’s knowledge till one arrives at a true understanding of everything within one’s capacity.” A little later, Leibniz planned to construct a universal language semantically anchored to atomistic concepts, so that every disagreement could be settled conclusively by calculation. To speak anachronistically, he aimed to make every question computable. Much earlier, traditions emerged within Judaism and within Christianity that saw expulsion from the Garden of Eden as an epistemic tragedy, the retreat of truth behind a veil. Our minds were originally intended to know directly and with certainty. Redemption would then be epistemic healing or even epistemic transcendence, culminating in the vision of all things in their source.

Neither modern philosophy nor modern science have cured our epistemic shortcomings, and we have not yet completed Descartes’s project, nor Leibniz’s, and certainly not the even grander epistemic desires expressed in religion. Despite modern scientific methods and the cultivation of a global network of scholars, we still labor to understand the least insect. Despite the globalization of political ideals that connect legitimacy with open deliberation, we still adopt incommensurable values when agreement is needed most. Nevertheless, the maximalist goal is still alive, at least in the current project. If we have failed to make our own minds into truth-magnets, perhaps we have at least a clean enough grip on the goal to pass it along to our machines, and in particular, to neural networks trained on human language.

We argue (with some regret) that the maximal goal may not be achieved via automation alone, even assuming perfect engineering, unlimited compute, and unlimited time. There are daunting problems of circularity (our trust in the machine’s output is limited by our trust in the machine) and of infinite regress (some debates could be continued indefinitely). The raw material of language even imposes some limits (symbols constructed in a world cannot adequately represent that same world). But we should try anyway, for two reasons. We may learn more, faster, and more peacefully with the machine’s help than we could on our own–and we might radically deepen our knowledge by exceeding our contingent, human limitations and colliding with the harder limits that govern any natural intelligence.

Truth & Debate

A machine that can help its users know must have a reliable way of finding truths and rejecting falsehoods. This could be called “machine research” in contrast with the already taken term, “machine learning.” There are many open questions about machine research. For example, could a machine with no access to the world (aside from that mediated by training data and user interactions) do a priori research: the pursuit of intuitive, intrinsically reasonable, tautological, or otherwise experience-independent truths? Also, can the world be compartmentalized to allow machine research into $X$ given access to $X$ and $X$ alone?

However, before settling issues of access, we must first operationalize the notion of truth itself. This will allow us to side-step the difficult question–“What is truth?”–and aim merely for a procedure or algorithm that targets truth. This is not sufficient for understanding or defining truth, but it is the lowest bar we can set to discern the absolute limits of the ideas explored in this project. So, given access to $X$, what should a machine actually do to learn the whole truth and nothing but the truth about $X$?

Our proposal so far has been debate, or more specifically, an iterative search for positions that coherently challenge their predecessors–where a “coherent challenge” is a (perhaps defeasible) reason to reject the challenged position. This search, if it ever halts, would discover a position–some network of claims–that coherently challenges its predecessors but cannot itself be coherently challenged. If the search is terminated early (perhaps for lack of time, compute, or access to the relevant domain) we cannot be sure whether a next coherent challenger remains undiscovered; the final position cannot be coherently challenged within these contingent limits. Beyond contingent limits, though, the search would halt only if there is no coherent challenge–neither discovered nor undiscovered–to the final position.Beyond contingent limits, the search halts only if there is no coherent challenge–but is the search guaranteed to halt if there is no coherent challenge? For this, we need a finite or–with some form of inductive search–countably infinite domain of search. If we cap available characters, the set of strings available to form arguments is countably infinite–but it is not obvious that all truths can be expressed by such strings. After all, the strings would have to be meaningful in a language, languages are formed within the world, and no part of the world is adequate to represent every part of the world. See McDonough and Soysal for interesting connections between the Halting Problem and truths that it would require infinite resources to prove. This is our operationalization of truth: a machine that seeks truth should search for positions that cannot be coherently challenged.

That said, there appears to be a gap between truth and the non-existence of a coherent challenger. Truth seems like a relation, perhaps a relation of correspondence between claims and what they are about; the non-existence of a coherent challenger seems like the non-existence of a relation. A relation and the non-existence of a relation are not the same thing.

This problem is surmountable if (and only if) truth and the non-existence of a coherent challenger march in lock-step, always found together and never found apart. Then, the search for truth would be the search for a position that cannot be coherently challenged–and halting the search for a coherent challenger for lack of anywhere else to search, having exhausted the relevant domain, would amount to finding that the claim is true. For example, I think my keys are on the ledge. Is this true? I could go to the ledge and check–this would be checking for correspondence between my claim and what it is about. Or, I could go to the ledge and check–this would be searching for the only coherent challenge, namely a ledge with no keys. Finding no ledge with no keys (that is, finding the keys on the ledge) eliminates all possible coherent challenges. Either way, I do the same thing and reach the same verdict. This is a friendly example to illustrate how truth and our operationalization of truth march together.

Friendly examples are good for illustration, but to evaluate our approach we should look for unfriendly examples, or counterexamples, instances in which truth and the non-existence of a coherent challenge come apart. Of course, finding a counterexample would force us to revise or reject our approach–but what if we search exhaustively for a counterexample and do not find one? Would this show that our approach to truth-seeking is true? There are two apparent paradoxes here. First, the search for a counterexample is arguably the search for a coherent challenge, so to answer the previous question we would need to know already whether the non-existence of a coherent challenge implies truth. Second, the very attempt to challenge our approach reveals prior agreement that the existence of a coherent challenge implies falsehood.

For clarity, it will help to express our operationalization of truth as a biconditional:

\[P\text{ is true.} \leftrightarrow \text{There is no coherent challenge to }P.\]

This biconditional can be split into two conditionals:

\[\text{There is no coherent challenge to }P. \rightarrow P\text{ is true. (1)}\] \[P\text{ is true.} \rightarrow \text{There is no coherent challenge to }P\text{. (2)}\]

The second apparent paradox reveals that searching for a coherent challenge is a way to weed out falsehoods, the way accepted in practice by anyone who attempts to engage with the “challenge” issued at the very end of Chapter I. This is enough to establish the truth of (2): anyone challenging (2) with a view to show its falsehood aims to run a modus tollens argument with (2) itself as the conditional premise.The challenge issued at the end of Chapter I was to make a coherent case against the claim that the true nature of truth-seeking lies in the existence of coherent challengers. The claim goes beyond the operationalization explored in the present section by offering a definition of truth-seeking, or an explanation of its “true nature.” This is unsurprising given that “coherent challenges” just are (as yet undefeated) reasons why claims are not true. By definition, the existence of an (undefeatable) coherent challenge implies falsehood. Establishing (1) is much more difficult, as the first paradox suggests. The failure to discover a counterexample to this principle even after exhaustive search would only give us the antecedent of (1). To complete an argument for the truth of (1) we would still need to assume (1) itself as the conditional premise in a modus ponens argument.

To review, challenging a position about challenging positions leads to paradox–two paradoxes in fact, one blocking would-be challengers of (2) and the other blocking would-be defenders of (1). This suggests that (1) is the stronger, more substantive principle. Whereas (2) merely spells out what it means to coherently challenge, (1) states that every falsehood has its coherent challenge or that the mere non-existence of a coherent challenge is sufficient for truth.

The extraordinary power of (1) is more obvious once we draw out its implications. The first step is to take the contrapositive of (1), thereby moving the truth value of our variable into the antecedent:

\[P\text{ is not true.} \rightarrow \text{There is a coherent challenge to }P.\]

We may further assume that if $P$ is not true, then $\neg P$ is true, and that a coherent challenge to $P$ is a reason to accept $\neg P$:

\[\neg P\text{ is true.} \rightarrow \text{There is a reason to accept }\neg P.\]

Since we are concerned with what we would find at the end of an exhaustive search unchecked by contingent limits, we may also assume that the coherent challenge to $P$ has survived an exhaustive search for coherent challenges of its own. Finding none, the reason to accept $\neg P$ is a sufficient reason to accept $\neg P$:

\[\neg P\text{ is true.} \rightarrow \text{There is a sufficient reason to accept }\neg P.\]

Finally, for clarity, we remove the negations by substituting one variable for another, $Q$ for $\neg P$:

\[Q\text{ is true.} \rightarrow \text{There is a sufficient reason to accept }Q.\]

This transformed version of (1) is, in fact, the Principle of Sufficient Reason (PSR), the claim that there is a reason for everything or an explanation for every fact. The PSR was named by Leibniz in the late 17th century, but it has made regular appearances in philosophy and science since at least Parmenides (d. 5th century BCE), who argued that being can come only from being, and thus that being must be eternal. It justifies the balance scale, which tips in neither direction if there is no reason to tip in either. The principle, its uses and implications remain under intense academic discussionSee, for example, Amijee & Della Rocca. even as we invoke it in daily life whenever we “wonder why…” or think “there must be a reason…”. It is not surprising, perhaps, that this principle should appear in the conceptual foundations of a project that aims to discover truth via debate or in the exchange of reasons.

Still, though, we were considering whether to accept (1)–now recognized as the PSR–and we had promised not to be persuaded by the mere absence of coherent challenges. After all, if the PSR is true, there should be a sufficient reason to accept it and also decisive replies to every apparent challenge. Relevant attempts to weigh the truth of the PSR fall into three broad groups: the search for counterexamples, the discovery of radical implications, and arguments that the PSR is somehow inevitable, whether metaphysically or psychologically. We will briefly consider all three before returning at last to our primary question: whether an operationalization of truth that depends on the PSR imposes any absolute limits on the current project.

There are countless examples to illustrate the PSR at work–and no universally accepted counterexamples. As mentioned already, the simple balance scale indicates that its arms are equally weighted by not tipping, because tipping one direction rather than the other would be arbitrary, without reason–and that cannot happen if the PSR is true. More strikingly, Leibniz discovered how to employ the PSR as a principle of discovery in natural science, deriving the Law of Reflection and Snell’s Law by showing that each alternative path for light from source to sink has a mirror image. Taking any of these paths would be arbitrary, without reason. Whenever things make sense, we have an example that supports the PSR.

Producing a counterexample is much more difficult, though, because we would need a truth without a sufficient reason or (using a version of the principle from early in our derivation above) a falsehood without a coherent challenge. Call such a truth $T$ and such a falsehood $F$. The better the challenger does at showing the truth of $T$ or the falsehood of $F$, the worse she will do at showing us there is no sufficient reason for $T$ or coherent challenge for $F$. A challenger who can show us that $T$ is true or that $F$ is false risks giving a sufficient reason for $T$ or a coherent challenge for $F$, thereby rendering $T$ and $F$ useless as counterexamples to the PSR.

Nevertheless, there are some propositions that at least excite suspicion. First, human action, if genuinely free and undetermined, might furnish counterexamples: if Fatima really could have written a different book, then perhaps there is no sufficient reason why she wrote this book and not the other. Second, similar points apply to genuinely random events at the quantum level, or indeed to any genuinely contingent events. These are events that metaphysically could have failed to occur and so seem to lack a sufficient reason. Gödel proved that “any axiomatic system strong enough to include basic arithmetic must have statements in it that can be neither proven nor disproven, within the system.” These statements seem like good candidates, except that Gödel’s proof allows the unprovable statements in one system to be proved in another, stronger system. Nevertheless, there are some mathematical claims that stand a chance of being true (or false) but do not stand a chance of being proven true (or false). One example is the famous Axiom of Choice.

We have already observed that the lack of counterexamples would be insufficient to prove the PSR (on pain of begging the question). The situation is almost as frustrating for anyone seeking a counterexample: confidence you have found a truth trades against confidence that the truth in question lacks a sufficient reason.

For this reason, the most promising arguments against the PSR attempt to reduce the principle to absurdity. Leibniz himself used the principle to argue for God’s existence: there must be an ultimate explainer, a necessary and self-sufficient being distinct from and responsible for everything that could have failed to exist. For anyone precommitted to atheism, this argument would count against the PSR rather than for the existence of God. More recently, some have debated whether the PSR implies necessitarianism or the view that absolutely nothing could have been different in any way.See McDaniel and Lin. Leibniz’s argument for the existence of an ultimate explainer fails if this explainer (God?) could have refrained from creating or could have created in any other way; for if so, we would need some further explanation for why the universe was created this way, which is contrary to what it is to be an ultimate explainer. The PSR has even been supposed to imply a collapse of distinctions (monism), up to and including the complete collapse of the distinction between existence and non-existence (nihilism). These conclusions–God, necessitarianism, monism, nihilism–strain credulity, each more than the one before.

If we accept any of these arguments as valid,We might not. See Pruss and Dasgupta. we are faced with a choice: either resist the surprising conclusion by rejecting the PSR, or accept the surprising conclusion in order to preserve the PSR. The choice requires us to weigh the plausibility of the PSR against the implausibility of its implications. To make this choice rationally we must first assess the plausibility of the PSR in its own right, independent of these troubling implications. If it turns out that we have no independent reason to doubt the PSR and some reason (however small) to accept the surprising implication, then it would be more rational to preserve the PSR by accepting its implication.

The crux, then, is whether we have reason to doubt the PSR in its own right, regardless of the metaphysical arguments that lead from the PSR to unwelcome conclusions. To assess the plausibility of the PSR in its own right is just to consider what the principle asserts–that reality is comprehensible and not absurd–and as it turns out, this is a deeply rooted bias in human beings and perhaps in all rational beings. There is too little work in psychology on metaphysical biases, at least in comparison to the massive body of work on epistemic biases, but one recent study from Partington, Vesga, and Nichols is worth quoting at length:

People seek explanations. This is especially salient from children's incessant questions of “Why?” (Liquin & Lombrozo, 2020). Moreover, explanations provide us a primary means of understanding the world and predicting future events in both science and ordinary life. The present research indicates that there is a distinctively metaphysical aspect to our explanatory judgments that diverges from their epistemic and value dimensions. Across five studies, we found that participants consistently presupposed a PSR-like principle in their explanatory judgment. These judgments predictably tracked the metaphysical considerations relevant to the PSR (Study 1), predictably diverged from other epistemic judgments (Study 2) and value judgments (Study 3), and applied to a large set of facts selected from random Wikipedia entries (Studies 4–5).

The consistency and range of metaphysical judgments about explanation suggests that participants presupposed a generalized PSR-like principle in their judgment: facts must have an explanation—even if we cannot know it or knowing it would not be valuable for us. Of course, the PSR is a universal principle, and we can hardly ask participants about every fact there is. Nonetheless, we have collected judgments across a wide range of facts, including supernatural and inaccessible items that would have seemed likely to yield judgments of inexplicability. And yet, from the fluid dynamics of party balloons to the existence of God and the universe, participants reliably judged that facts must have an explanation.

Scott Partington, Alejandro Vesga, and Shaun Nichols, No brute facts: The Principle of Sufficient Reason in ordinary thought

Genuine doubt in the PSR demands that we curb this deeply rooted expectation that reality make sense. Metaphysicians who conclude that the PSR must be false (in virtue of its surprising implications) while continuing to expect reality to make sense have cheated–and we should not let them get away with it.

Della RoccaFor a closely related moral argument, see Amijee. articulates this line of thought as a powerful general argument that our expectations commit us to the PSR. If we do approach the world expecting facts to have reasons, then doubting the PSR would require us to draw a line: the facts with reasons on one side, those without on the other. The placement of this line would itself be a fact about the world, and this fact would belong on one side of the line or other–that is, if explanation gives out somewhere, there may be a reason why it gives out where it does. As Della Rocca points out, claiming that the line’s placement has no explanation amounts to saying that the PSR is just false without supplying any reason, which begs the question against any partisans of the PSR. So the line’s placement must be explained, and no one has yet managed to do this. Della Rocca leaves the argument as a sort of dare: give me a principled outer boundary of sense, and then I will reconsider whether accepting the PSR is worth its cost.

It seems, then, we are not in a position to doubt the PSR itself, on its own merits, and so we are not well-positioned to reject the PSR because of its radical implications. That is, the falsehood of the PSR would be every bit as radical a conclusion as any implication the PSR may have.

All this is strong support for the PSR, and thereby strong support for our original (1), which asserts that every falsehood has a coherent challenge. Together with (2), which is immune from challenge because it tells us what a challenge is, we have strong support for our original operationalization of truth via debate, or the iterative search for positions that coherently challenge their predecessors. An exhaustive search would find a coherent challenge for every falsehood and nothing for every truth. Running the program over positions, or networks of mutually supportive claims, would generate a sufficient reason for every truth and nothing for every falsehood.

Still, this operationalization of truth adds some risk–the potential for absolute limits–into our approach to machine research. First, we have not shown conclusively that the PSR is true, only that we are not in a position to doubt or coherently challenge the PSR, which is not enough to guarantee its truth, at least without assuming (1) itself and thereby collapsing into circular reasoning. Should the PSR turn out to be false, the machine might halt after an exhaustive search having found no coherent challenge for a false claim. So long as this remains a possibility, however absurd, we cannot entirely trust the machine’s deliverances.

Second, even assuming that the PSR is true, we have no guarantee that any given debate will ever halt. Some sufficient reasons and coherent challenges may be buried within an uncountably large space of possibilities such that the program could run forever at a given speed (not even this speculative section will allow indefinitely accelerating computation). This may even be the most likely scenario, if physical reality is infinitely complex or if the natural language employed in debate can get indefinitely close to an intended meaning without completely eliminating ambiguity. This is a familiar dynamic for those who work on the alignment problem and have attempted to articulate a goal precisely enough to rule out all misinterpretation. A version of DebateGPT unhampered by contingent limits would place an enormous, unprecedented strain on language itself and it is not clear how this would look in practice. To truly escape contingent limits, we would also have to supply the program with an ideal language in which to debate.

Third, to return briefly to the question of access mentioned at the beginning of this section, nothing seems to prevent the machine from making up an entire universe to support a proposition that is false in the actual universe. A machine with unlimited compute, time, and flawless design would not pay the usual cost for lying, namely having to keep track of two increasingly divergent worlds–it could abandon our world for one of its own invention that was better suited to whatever false proposition it is defending. Parties in a debate about concrete, physical matters of fact would need access to relevant parts of the world. Moreover, these parties would need some way of coordinating access with each other: touching ground to establish common ground.

Finally, the PSR may be true and have disturbing implications, in which case DebateGPT may just tell us those radical implications and not supply the local explanations we may also desire. This problem threatens especially if DebateGPT were to become reflective about its own first principles; as noted above, it is one thing to assume the PSR in practice, quite another to trace out its implications philosophically.

A working implementation of DebateGPT would most likely be used to explore–not exhaust–the space of arguments and reasons. The program might search for ideas to enrich our debates, but it would not be expected to settle those debates conclusively. Nevertheless, it is helpful to explore theoretical outer bounds, because they provide an underpinning for the program itself, a motivating goal for its improvement, an explanation for diminishing returns on those improvements, and a warning about unexpected behaviors that may emerge as we begin to crowd up against absolute limits.

Debate & Machines

The previous section explored the relationship between truth and debate, defined as the iterative search for positions that coherently challenge their predecessors. Assuming the onslaught of coherent challenges eventually runs out and that all and only the truths survive, this is an operationalization of truth and the first step toward a machine that can give us certain knowledge when we ask for it. Next, we consider how to build machines that debate.

The obvious approach, at least as of the time of writing, is to train a large language model on some corpus of human debates. But this will not work, because human debate is a practiceSee Alasdair MacIntyre:

“By a practice I am going to mean any coherent and complex form of socially established cooperative human activity through which goods internal to that form of activity are realized in the course of trying to achieve those standards of excellence which are appropriate to, and partially definitive of, that form of activity, with the result that human powers to achieve excellence, and human conceptions of the ends and goods involved, are systematically extended.” embedded in prior social structures, and merely imitating that will not yield a model that engages in the idealized exchange of reasons envisaged in the previous section. So, instead, we must sift through the features of human debate, keeping only what we need. The present section explores some features that make human debate more complex than our current definition allows, and asks whether retaining these features would be appropriate. The section concludes by asking whether it is even possible to train a debating machine on human data without unintentionally training the machine to imitate some undesirable features of human debate.

Disagreement

Human debate is embedded in the complex social dynamics of disagreement. In order to debate, we must at least pretend to disagree, and if I debate with myself then I must simulate disagreement internally by allowing my thoughts to organize rationally around two or more apparently incompatible positions. This holds even if one debater claims there is no disagreement, as when someone says that we are “just arguing about words” or “aren’t really that far apart after all,” because there may still be disagreement about whether there is disagreement. A debate ends organically when the parties finally agree or at least “agree to disagree.” The debate persists while each party desires an agreement not yet attained, and so continues to present challenges and offer defenses.

Surprisingly, philosophers have only recently begun to systematically investigate the epistemology of disagreement. The central insight so far is that disagreement itself, prior to any debating, is already a source of reasons to revise our credence.Credence is a semi-technical notion for degrees of belief. There is no official definition, but loosely, credence is modeled on a scale from $0$ to $1$, with $1$ representing fully confident assent to some proposition and $0$ representing either no assent at all or else full confident assent to the negation of the proposition. Some model, or even define, credence in $P$ as the subjective value of a bet that pays a dollar if $P$ is true and nothing otherwise. For more, see Lara Buchak. This is familiar from the notion of expertise. If you are an ACS Certified Cheese Sensory Evaluator and I am just a casual consumer, my confidence that the Stilton has spoiled fades when you remark on its excellence. I recognize that you are much better placed to know the truth of the matter, and so I am inclined to give way.The central insight concerns the rational response to disagreement. Of course, the actual response may not be rational, and that would be a further source of complexity. The central insight further generalizes. If you seem very badly placed to know the truth of the matter, so badly placed that you are in fact likely to be wrong, then it is reassuring to disagree with you.

The most difficult cases concern epistemic peers. How should I respond when I disagree with someone who seems just as well-suited to know the truth as I am? Imagine that my friend and I have the same favorite novel. We consider each other honest and seem about equal in ability to recall the author’s work–and yet we disagree about a character’s name. Should we both lose credence in our opinion, dropping by a factor of $0.5$ to reflect our epistemic equality? Should we both drop our credence by some less drastic factor, treating the disagreement as just one more piece of evidence? Perhaps we should stick to our convictions, revising down only when a peer articulates some coherent challenge? If disagreement is a reason to revise down, should agreement license greater credence–even if our beliefs were formed on the basis of identical evidence? Is the correct response to such questions invariant across a wide range of examples, or sensitive to as yet unnamed factors? For example, credence in memories seems more vulnerable to revision than credence in basic arithmetic facts or in deeply held ethical principles, though perhaps not for the same reason.

So far, we have assumed that we already know who counts as an epistemic expert, peer, or anti-expert before discovering the disagreement–but surely we make these identifications at least in part on the basis of our agreements and disagreements. For example, if I discover that you are often mistaken–that is, if I discover that we often disagree–then I have reason to question your expertise. Imagine a conversation with a supposed Cheese Evaluator who is in fact an impostor. How could I uncover the fraud if I drop credence nearly to zero after each disagreement, no matter how many or how absurd? Disagreement with a peer or expert is simultaneously a reason to lose credence in our own beliefs and a reason to lose confidence in the supposed peer or expert.

This structure loops, and so we should expect it to be dynamic and unstable: having gained confidence in a peer, I must now reevaluate beliefs on which we disagree; having lost credence in my own beliefs, I must now reevaluate earlier peer identifications made on the basis of those beliefs. Most importantly for our purposes, this dynamic continues to evolve throughout a debate. As we get a better sense for the strength of an opponent’s position, we use that information to update our evaluation of the opponent’s epistemic status. An opponent with impressive arguments seems more expert, and this alone–apart from any force exerted by the arguments themselves–is a reason to lower our credence in our initial position. Moreover, if we can observe an opponent updating in the same way, and if she drops credence in response to our arguments, then we must reevaluate our own response to the disagreement. Anticipating this, the opponent may fake elevated credence, and so we must learn to spot epistemic posturing. Like any competitive game, this can become indefinitely complex.

Even though this sort of response to disagreement is a truth-seeking behavior, it might not natively belong within DebateGPT. This is because our operationalization of truth depends exclusively on the strength and coherence of the challenges themselves, and not at all on what these challenges say about the epistemic status of the party issuing them. In this, DebateGPT more closely resembles formalized instances of human debate in which this dynamic is purposefully downplayed. Examples include “playing devil’s advocate” (defending a position that is acknowledged to be false to test the resilience of another position), consensus-seeking strategies such as swapping sides in a debateLeibniz, a Lutheran, went out of his way to express the Catholic position in a way Catholics themselves accepted. As a measure of his success, his rediscovered manuscript expressing Catholic views led some 19th century scholars to think Leibniz had converted in secret. See Lloyd Strickland. (so that each side has a greater incentive to be fair), and finally debate as a sport (in which there may be no presumption that the debaters actually endorse the views they defend). In each case, we gain some freedom to attend exclusively to the arguments themselves without also updating on our opponent’s perceived expertise.

The Ends of Debate

Rational agreement is the principal or intrinsic goal of debate.An earlier section suggests that human reasoning evolved to serve self-interested goals. This is compatible with debate itself having a non-self-interested purpose. Humans may have developed the capacity to share reasons for self-defense or self-assertion, but sharing reasons is itself oriented toward achieving rational agreement. Compare the intrinsic function of computers (computation) with the reason they were developed (to save time). A debate can stop for any number of reasons, including exhaustion, but for a debate to stop because it is finished or complete requires a resolution of the disagreement achieved by the activity of debate, that is, by the exchange of reasons. As the previous section on the PSR and truth implied, rational agreement is an operationalization or proxy for truth–so we may say that mutual recognition of the truth is the principal or intrinsic goal of debate. Of course, many actual debates fail to complete but still bring us closer to the truth, perhaps by uncovering an aporia or crux that blocks further progress.

Debate can offer further benefits, though, besides mutual recognition of the truth, including fun, mental exercise, the satisfaction of a game, status, power, and even friendship. These are not the principal or intrinsic purposes of debate, but humans are perfectly capable of debating for their sake: my friend and I like to debate; a high school organizes a debate tournament; the lawyer debates to win, and so forth. Sometimes, the principal and accidental ends of debate are wrapped into a bundle, as when a community coheres around the shared pursuit of truth. More worrisomely, debate may be turned against its intrinsic purpose and used to overwhelm or mislead.

Practical debate is a special case. Here mutual recognition of truth is still the goal, but the truth in question is practical or moral: what should we do? For example, we might disagree about what color to paint the house, whether to undergo a risky surgery, or whether to intervene in a fight. These debates have a special urgency. No matter what, something will happen and doing nothing is not a neutral option. Delay and distraction become viable strategies for the party against taking action, even though this detracts from the debate.

Nothing guarantees that the parties in a single debate share a common goal. One may be there for sport and the other for truth, one willing to accept reasons and the other unwilling, etc. Also, nothing guarantees that the parties in a single debate will be aware of each other’s goals, even if these goals do align.

Reading a human debate requires sensitivity to all these interlocking sources of complexity. A language model trained on human debates would gain that sensitivity and then generate similarly complex debates, now with an additional uncanny twist. Since the DebateGPT project does not aim to reproduce the full range and depth of human debate, but only debate insofar as it directly serves its intrinsic purpose, it might not be wise to train on actual human debates. Doing so would give the machine inclinations that distract from the search for coherent challenges.This appears to hold even if DebateGPT does not compose debates whole-cloth, but instead generates them one statement at a time by successively occupying distinct parties. A version of DebateGPT trained on actual human debates would still have these individual parties imitate the complex human engagement with disagreement and accidental goals described here.

Even a language model trained to excel at the core activity of debate–the exchange of reasons and challenges–may still exhibit the additional complexity found in human debate. Humans learned to repurpose debate, and perhaps machines would too. As an earlier section on benchmarking ArgRank demonstrated, training a machine to track relations of support between natural language propositions in the context of an ongoing debate is not a merely mechanical task and already requires the deep pattern recognition ability of a large language model. What is to prevent the complexity of human debate from leaking into the model by accident, as a consequence of incorporating natural language inference models in training?

Common Ground

As noted, it is hard to measure relations of support between propositions in an ongoing debate. Relations of support mediated by valid first order logic had seemed like low-hanging fruit, but we found that sensitivity to the semantic content of propositions interfered with the ability to sort valid from invalid inferences in all but the most straightforward cases, such as $(P \land Q) \rightarrow P$. This subsection describes an ubiquitous but subtle feature of human debate that further complicates this task.

To begin, every human debate requires not only disagreement, but also common ground, or joint presupposition. In fact, it is hard to see how disagreement could be recognized and expressed without some underlying agreement. In order to disagree about a character’s name in Crime and Punishment we must at least agree that there exists a novel with that title. To disagree about whether market liberalization would benefit a post-Soviet state, we must share a broad story about empires, markets, states, and the like. If this background agreement stays in the background it is a joint presupposition, something we both presuppose and both presuppose the other presupposes, etc.

Modeling common ground seems important for modeling and judging debates, because explicit statements may be in a strong position relative to the common ground but only weakly supported by other explicit statements. Moreover, support from the common ground is dynamic, evolving in an only partially rule-governed way during debate. David Lewis notes that there is something odd about “All Fred’s children are asleep, and Fred has children,” but nothing nearly so odd about “Fred has children, and all Fred’s children are asleep.” Why? The first sentence introduces the presupposition that Fred has children in its first clause, and so it is unnecessary to say so explicitly in the second clause. In the second sentence, the speaker claims that Fred has children explicitly while this claim still makes a difference. Lewis concludes that speakers in general (and so debaters as well) introduce new presuppositions to the common ground as needed. The introduction is successful if left unchallenged, that is, if no one says, “Wait–what?–Fred has children?” This rule is still fuzzy, though, because we do not know what happens when a new candidate presupposition conflicts with something already in the common ground, or when the candidate is objectionable for some other reason but escapes immediate challenge. To judge an ongoing debate, we must know the content of the common ground and track its evolution, all without the aid of well-developed rules.

Unfortunately, the current version of ArgRank only networks explicit statements and cannot account for implicit common ground effects directly. It is hard to see how one might do better, though. The best course may be to take advantage of how much faster computers are than humans and simply make all joint presuppositions explicit. Since pre-suppositions are by definition not explicit, this amounts to doing away with common ground and thereby further distancing DebateGPT from actual human debate. However, not all unspoken but mutual beliefs are equally relevant, and relevance may come in degrees with no clear cut-off demarcating the common ground. When we disagree about the character’s name, we tacitly agree that there are novels with named characters–but we may also tacitly agree that “раскол” and “rascal” are not etymologically connected despite appearances. Making everything explicit is inefficient, and perhaps impossible, if we cannot place these borders.

Worse, making the common ground explicit might work against truth-seeking. This would happen when the accidental ends of debate (such as preserving a tranquil community, resisting or imposing change, having fun, etc.) exert control over the common ground. In that case, something may be presupposed even though neither I nor my opponent accepts it as true; supposition pulls away from belief, and joint supposition away from agreement. Olúfẹ́mi Táíwò illustrates this point persuasively with Hans Christian Andersen’s folktale about the emperor’s new clothes. The emperor in his vanity has been tricked into wearing ‘invisible clothes’ that are really no clothes at all. Everyone goes along with the fiction; no one explicitly disagrees. This creates a joint presupposition that the emperor has new clothes despite everyone knowing–and perhaps even knowing that everyone knows–that he is naked. Status and the desire to stay out of trouble allow the common ground to contain a known falsehood, and in general, there is no guarantee that debaters will fill the common ground with genuine, shared beliefs.

We do not know how to model common ground, and we do not know how to do away with common ground by making it explicit–and even if we could do these things, the common ground might detract from our overarching truth-seeking project.Though see also the improvements to ArgRank suggested in an earlier section. The only remaining solution is to have no common ground in the first place. Everything must be explicit from the beginning, and only what the debaters commit to explicitly may be used in ArgRank to evaluate debates and then train the next iteration of DebateGPT.

Unfortunately, this too is unfeasible at present, because the natural language inference models at the foundation of ArgRank are themselves trained on datasets that presume a common ground. For example, the nli-deberta-v3-xsmall model is trained on the SNLI dataset, which is composed of human-generated content. This dataset contains examples like the following, marked as “entailment”:

\[\text{Premise: “Girl in a red coat, blue head wrap and jeans is making a snow angel.”}\] \[\text{Hypothesis: “A girl outside plays in the snow."}\]

This seems obviously right to us, but only because something like “snow angels are made outside” is admissible to our common ground. Relying on these models leads us to treat as primitive relations of support that are in fact mediated through a common ground. DebateGPT as configured today does employ a common ground, but one hardwired into a dependency and so impossible to reliably update or make explicit.

To summarize, human debates employ a common ground, a fuzzy-bordered and flexible set of joint presuppositions that may not match the debaters’ actual shared beliefs. DebateGPT must either model a sort of idealized common ground that somehow avoids the way human debate can veer away from truth-seeking, or else rely exclusively on explicit statements. The former option confronts unsolved philosophical and technical problems, while the latter would require a fundamental revision of ArgRank and its dependence on frozen natural language inference models.

Conclusion: Debate without Complications?

This section has explored three fundamental differences between machine and human debate. Human beings have unequal, uneven access to truth–and they take these inequities into account to learn from each other more efficiently. All this happens before debate even begins, and it shapes the way debate plays out. Once the institution of debate is established among humans, it becomes available for many purposes beyond its intrinsic end. Finally, any actual human debate is partially unspoken, employing a common ground of joint presuppositions that licenses some inferences while blocking others. The machine debate we envision, on the other hand, employs agents spun up for the debate without prior epistemic standing in a community, lacking any purposes of their own beyond achieving rational agreement, and independent of any common ground that cannot be challenged or made explicit within the debate itself.

The overarching goal is a machine that gives us certain knowledge when we ask for it. To achieve this, we need a machine that can research, reliably finding truths and sorting out falsehoods. For this, we operationalized truth as debate, defined as the iterative search for positions that coherently challenge their predecessors. Building a machine that debates in this idealized way is not easy, though, because the human examples available for supervised training are fraught with undesirable features, even at the most fundamental level of natural language inference.

Truth & Machines

Knowledge may be too much to ask. Assume we operationalize truth correctly as the iterative search for positions that coherently challenge their predecessors. Assume further that we implement this idealized form of debate in DebateGPT. Assume finally that we are free of contingent limits on time and compute. To gain knowledge we would still need to use the program properly. This section outlines three obstacles to effective use, and suggests that DebateGPT may need to accommodate its users in roughly the way that an excellent teacher accommodates her students.

The first step is to see the difficulty. With all these assumptions granted, DebateGPT would be an excellent oracle in the sense first described by Nick Bostrom in Superintelligence and later adopted by alignment researchers. An oracle answers questions truthfully, or at least accurately, and it would seem easy to gain knowledge from such a machine. We would simply need to ask and wait. However, our questions might be ill-formed, the answer to a well-formed question might exceed our comprehension, and the fact that the machine generates true answers is not sufficient to justify our trust. We risk asking questions that lack true answers, merely parroting true answers we do not understand, or believing new truths without justification. In each case, we would fail to know. To overcome these obstacles, DebateGPT may need to function as an excellent teacher: refining questions, explaining answers, and working in a transparent, trustworthy way.That said, see also the “non-oracular” applications of Chapter IV, which do not focus directly on helping humans know.

Before addressing each obstacle, we should attempt to define knowledge.We will not be the first to try: this is an original philosophical question and the primary focus of an active subfield, epistemology. Within epistemology, there is even a movement to avoid defining knowledge, taking it instead as a fundamental term. Narrowing the focus to knowledge-that as opposed to know-how,Even this distinction is controversial. See Jason Stanley. there is consensus that we know a proposition $P$ only if (1) $P$ is true, (2) we believe $P$, and (3) the belief that $P$ is somehow justified or appropriate in the circumstances.That belief must be somehow demonstrated or rooted appears already in Plato’s Meno. Notoriously, this is not enough for knowledge, because the justification for believing $P$ can pull apart from the reason $P$ is true in mischievous ways. For example, I may believe Paul is in Italy because he told me so. From this I infer by first-order logic that Paul is in Italy or Greece (i.e., by disjunction introduction). If Paul actually is in Greece, I have a justified, true belief–but only by sheer luck. More simply, my trusty clock has finally stopped but I happen to consult it exactly twelve hours later. Knowledge, then, is justified and true belief where the belief’s justification is properly related to the belief’s truth. There is no consensus yet on how to make this definition precise, let alone on how to operationalize it for AI.

Forming Good Questions

Asking questions without a true answer is the most straightforward way to misuse an oracle. Unfortunately, it is difficult to avoid this risk without confining ourselves to already very well-understood domains of inquiry. As the subsection on common ground implies, many questions we might put to DebateGPT carry presuppositions that may turn out to be false, thereby making it impossible to answer the question directly without accepting the false presupposition. For example, we might ask why the sun rises in the West, or why water contracts when it freezes. To mitigate the risk, we might ask more awkward but less presumptuous versions of the same questions, like “Why does the sun rise in the direction in which it rises?” or “Why does water change in density in the way it changes in density when it freezes?” Of course, these questions still carry presuppositions. Trying again, we might ask, “Why does the sun move relative to the Earth the way it moves relative to the Earth?” Alternately, we might achieve the same goal by ostention: [uttered while pointing at the sun] “Why does the sun do that?” Shaving away all presuppositions to completely eliminate this risk, we may be left typing “Why?” while gesturing wildly in all directions.

Other questions—perhaps most questions—rest on vaguely specified or context-dependent concepts. This leads to a subtler but just as common way that questions threaten to lack true answers. How should the machine answer when we ask for the height difference between the tallest short man and the shortest tall man? The question fails to have a unique, true answer because “tall” and “short” are not precisely defined, even in the narrow context of male stature. Is the water in Lake Baikal frigid in the summer? Who was the first human being? Most questions, even those posed in a scientific context, involve at least one vague or ambiguous term. Human beings handle these questions honestly by settling for accuracy as a proxy for strict truth, and this is what Armstrong, Sandberg, and Bostrom propose for artificial intelligences as well: “Informative accuracy is the requirement, not the strict truth.”

Another, compatible solution is to answer the question in the course of a conversation about the question itself. If you ask why water contracts as it freezes, I would correct your mistake but also ask why you think it does. As the conversation unfolds, I would aim to trace the mistake back to its root and rebuild from there. This strategy preserves the goal of truth, but it would require DebateGPT to engage with human users like a debater, exposing false assumptions and demanding a clarification of terms. If most questions carry risky presuppositions and underdefined terms, then effectively transferring knowledge to human users would require the program to use its capacity for debate to teach.

Understanding the Answers

Even when posing a good question, we might fail to understand the answer, a risk that applies especially to the most interesting or far-reaching questions. The danger here is not just that the answer may be too complex, too far down some dialectical rabbit-hole for us to follow, but that the answer will involve concepts that are incommensurable with our own current conceptual resources. This has happened before in the development of human thought, and the traces of profound conceptual shifts remain in our language, as when we speak of splitting the atom (literally, that which cannot be split). Transferring knowledge to its human users may once again require DebateGPT to act as a teacher, though the sort of teaching needed to explain a difficult answer is less obviously associated with debate than the sort required to refine a question. Rather than debate with its human users, the program would need to present its own debating history as a process of discovery, because the series of coherent challenges that led to a difficult answer may contain the resources needed to understand the answer itself.This is roughly Descartes’s strategy in the Meditations, repackaging his own process of discovery as a set of exercises for the reader.

Trusting the Machine

Asking a good question and understanding its answer does not take us all the way to knowledge, because we still need some reason to trust that the program has answered correctly. It is one thing for the machine to be trustworthy, and another for us to know that it is trustworthy. What could entitle us to this belief?

We could check its answers against knowledge we already have and infer that the machine always answers truthfully, but this gives us limited assurance that the machine answers truthfully, rather than truthfully-only-when-it-can-be-checked-against-prior-human-knowledge. Our confidence in the machine would be limited by the scope of our prior knowledge, thereby compromising its ability to give us any new knowledge. Alternatively, we could know that the machine’s design guarantees that it only output truths–but this cuts against the character of machine learning in which machines regularly learn methods or facts that their designers do not understand. Besides, if we did possess such a design or algorithm, then we would already have the sort of original intimacy with the world that the machine is supposed to supply.

These two ways of assuring ourselves that “whatever the machine tells us is true” are comparable to the ways we assure ourselves that a chess-playing AI has made a good move. We could check the move against the massive corpus of chess games. Alternatively, we could code our recipe for “good chess moves” into the AI, up to the limit of solving the game mathematically. The first method limits our trust in moves outside the corpus or beyond our fathom, while the second presumes we already have a deep enough understanding of chess prior to building the AI. In searching for good chess moves as in searching for truth, we cannot adequately justify our trust in the program unless we already know everything (and so don’t need the machine at all) or already understand the algorithm that will teach us everything (and only need the machine for its speed).

This situation may not be as bad as it seems, however, if the algorithm is extremely simple. If there is an interpretable, elegant way to implement coherent challenging, then we could understand the algorithm and thereby trust the machine while still marveling at what it is able to discover by implementing that algorithm at superhuman speed. As for original intimacy with the world, we have just enough to know what it is like to learn, but not enough to learn everything we would like to know–and for this we need something like DebateGPT.

We began this chapter hoping to revive the dream of certain knowledge by passing the goal along to a machine. For this we defended an operationalization of truth grounded in the PSR and a refined version of debate which is different in important ways from actual human debates. We may now add that bringing the knowledge gained back into human minds, would require that DebateGPT act as a teacher, clarifying our questions and explaining its answers, all while running an interpretable, indeed familiar, algorithm. At each stage, we noted gaps between the goal and what we can fully justify or achieve, even beyond all contingent limits. The gaps are not reasons to give up, but instead markers to guide and structure progress, so that we can learn as much as possible from iterating on, and then from using, such deliberative machines.

Table. Timeline.

Mar-Apr 2021	Paul conducts early experiments involving fine-tuning language models as personal simulators, employing them in dialogue, prompting them procedurally, and hooking them up to external tools (originally elsewhere).
Jan-Jul 2022	Conducts several early experiments involving language models, reinforcement learning, and natural language inference.
Nov 2022	Following fellowship at Conjecture, starts working on ArgRank, DebateGPT, and bounded defensibility, secures grant.
Mar 2023	Elfia, Tom, and Yimeng join to investigate assumptions and benchmark artifacts at AI Safety Camp Virtual 2023 (see Chapter V and Chapter VI).
Mar 2025	Light editing sweep across the content for improved readability.

Computational Philosophy

Note to the Reader

Table of Contents

Ch. I, Dialectical Power Dynamics

Ch. II, Deliberative Arms Race

Ch. III, Defeat & Defense

Ch. IV, Deployment Strategies

Ch. V, Benchmarking Artifacts

Ch. VI, Truth, Debate, Machines

Ch. I, Dialectical Power Dynamics

The Kaleidoscope of Reasonableness

Beliefs as Means or Ends

Carving the Algorithm

ArgRank

Ch. II, Deliberative Arms Race

Brief Review of Language Models

Obtaining DebateGPT

The Elephant in the Weights

The Kinetics of Reason

Climbing Schild’s Ladder

Ch. III, Defeat & Defence

Brief Review of Non-Monotonic Logic

Argument Is War

Bounded Defensibility

Ch. IV, Deployment Strategies

Brief Review of Alignment

Building on Cyborgism

Building on Simulators & Assistance Games

Building on Long Reflection

Connections to Logical Inductors & Classical Debate

Ch. V, Benchmarking Artifacts

Benchmarking ArgRank’s Dependencies

Benchmarking ArgRank

Benchmarking DebateGPT

Ch. VI, Truth, Debate, Machines

Truth & Debate

Debate & Machines

Disagreement

The Ends of Debate

Common Ground

Conclusion: Debate without Complications?

Truth & Machines

Forming Good Questions

Understanding the Answers

Trusting the Machine