Bostrom on Superintelligence (4): Malignant Failure Modes

To set things up, we need to briefly recap the salient aspects of Bostrom’s doomsday argument. As we saw the last day, that argument consists of two steps. The first step looks at the implications that can be drawn from three theses: (i) the first mover thesis, which claims that the first superintelligence in the world could obtain a decisive advantage over all other intelligences; (ii) the orthogonality thesis, which claims that there is no necessary connection between high intelligence and benevolence; and (iii) the instrumental convergence thesis, which argues that a superintelligence, no matter what its final goals, would have an instrumental reason to pursue certain sub-goals that are inimical to human interests, specifically the goal of unrestricted resource acquisition.

The second step of the argument merely adds that humans are either made of resources or reliant on resources that the first superintelligence could use in the pursuit of its final goals. This leads to the conclusion that the first superintelligence could pose a profound existential threat to human beings.

There are two obvious criticisms of this argument. The first — which we dealt with the last day — is that careful safety testing of an AI could ensure that it poses no existential threat. Bostrom rejects this on the grounds that superintelligent AIs could take “treacherous turns”. The second — which we’ll deal with below — argues that we can avoid the existential threat by simply programming the AI to pursue benevolent, non-existentially threatening goals.

1. The Careful Programming Objection

Human beings will be designing and creating advanced AIs. As a result, they will have the initial control over its goals and its decision-making procedures. Why couldn’t they simply programme the AI with sufficient care, and ensure that it only has goals that are compatible with human flourishing, and that it only pursues those goals in a non-existentially threatening way? Call this the “careful programming objection”. Since I am building a diagram that maps out this argument, let’s give this objection a number and a more canonical definition (numbering continues from the previous post):

(9) Careful Programming Objection: Through careful programming, we can ensure that a superintelligent AI will (a) only have final goals that are compatible with human flourishing; and (b) will only pursue those goals in ways that pose no existential threat to human beings.

As it was with the safety-test objection, this functions as a counter to the conclusion of Bostrom’s doomsday argument. The question we must now ask is whether it is any good.

Bostrom doesn’t think so. As his collaborator, Eliezer Yudkowsky, points out engineering a “friendly” advanced AI is a tricky business. Yudkowsky supports this claim by appealing to something he calls the “Fragility of Value” thesis. The idea is that if we want to programme an advanced AI to have and pursue goals that are compatible with ours, then we have to get its value-system 100% right, anything less won’t be good enough. This is because the set of possible architectures that are compatible with human interests is vastly outnumbered by the set of possible architectures that are not. Missing by even a small margin could be fatal. As Yudkowsky himself puts it:

Getting a goal system 90% right does not give you 90% of the value, any more than correctly dialing 9 out of 10 digits of my phone number will connect you to somebody who’s 90% similar to Eliezer Yudkowsky. There are multiple dimensions for which eliminating that dimension of value would eliminate almost all value from the future. For example an alien species which shared almost all of human value except that their parameter setting for “boredom” was much lower, might devote most of their computational power to replaying a single peak, optimal experience over and over again with slightly different pixel colors (or the equivalent thereof).

(Yudkowsky, 2013)

Bostrom makes the same basic point, but appeals instead to the concept of a malignant failure mode. The idea here is that a superintelligent AI, with a decisive strategic advantage over all other intelligences, will have enough power that, if its programmers make even a minor error in specifying its goal system (e.g. if they fail to anticipate every possible implication of the system they programme), it has the capacity to fail in a “malignant” way. That’s not to say there aren’t “benign” failure modes as well — Bostrom thinks there could be lots of those — it’s just that the particular capacities of an advanced AI are such that if it fails, it could fail in a spectacularly bad way.

Bostrom identifies three potential categories of malignant failure: perverse instantiation; infrastructure profusion; and mind crime. Let’s look at each in some more detail.

2. The Problem of Perverse Instantiation

The first category of malignant failure is that of perverse instantiation. The idea here is that a superintelligence could be programmed with a seemingly benign final goal, but could implement that goal in a “perverse” manner. Perverse to whom, you ask? Perverse to us. The problem is that when a human programmer (or team of programmers) specifies a final goal, he or she may fail to anticipate all the possible ways in which that goal could be achieved. That’s because humans have many innate and learned biases and filters: they don’t consider or anticipate certain possibilities because it is so far outside what they would expect. The superintelligent AI may lack those biases and filters, so what seems odd and perverse to a human being might seem perfectly sensible and efficient to the AI.

(10) Perverse Instantiation Problem: Human programmers may fail to anticipate all the possible ways in which a goal could be achieved. This is due to their innate and learned biases and filters. A superintelligent AI may lack those biases and filters and so consequently pursue a goal in a logical, but perverse, human-unfriendly fashion.

Bostrom gives several examples of perverse instantiation in the book. I won’t go through them all here. Instead, I’ll just give you a flavour of how he thinks about the issue.

Suppose that the programmers decide that the AI should pursue the final goal of “making people smile”. To human beings, this might seem perfectly benevolent. Thanks to their natural biases and filters, they might imagine an AI telling us funny jokes or otherwise making us laugh. But there are other ways of making people smile, some of which are not-so benevolent. You could make everyone smile by paralzying their facial musculature so that it is permanently frozen in a beaming smile (Bostrom 2014, p. 120). Such a method might seem perverse to us, but not to an AI. It may decide that coming up with funny jokes was a laborious and inefficient way of making people smile. Facial paralysis is much more efficient.

But hang on a second, surely the programmers wouldn’t be that stupid? Surely, they could anticipate this possibility — after all, Bostrom just did — and stipulate that the final goal should be pursued in a manner that does not involve facial paralysis. In other words, the final goal could be something like “make us smile without directly interfering with our facial muscles” (Bostrom 2014, p. 120). That won’t prevent perverse instantiation either, according to Bostrom. This time round, the AI could simply take control of that part of our brains that controls our facial muscles and constantly stimulate it in such a way that we always smile.

Bostrom runs through a few more iterations of this. He also looks at final goals like “make us happy” and notes how it could lead the AI to implant electrodes into the pleasure centres of our brains and keep them on a permanent “bliss loop”. He also notes that the perverse instantiations he discusses are just a tiny sample. There are many others, including ones that human beings may be unable to think of at the present time.

So you get the basic idea. The concern that Bostrom raises has been called the “literalness problem” by other AI risk researchers (specifically Muehlhauser and Helm, whose work I discuss here LINK). It arises because we have a particular conception of the meaning of a goal (like “making us happy”), but the AI does not share that conception because that conception is not explicitly programmed into the AI. Instead, that conception is implied by the shared understandings of human beings. Even if the AI realised that we had a particular conception of what “make us happy” meant, the AI’s final goal would not stipulate that it should follow that conception. It would only stipulate that it should make us happy. The AI could pursue that goal in any logically compatible manner.

Now, I know that others have critiqued this view of the “literalness problem”, arguing that it assumes a certain style of AI system and development that need not be followed (Richard Loosemore has recently made this critique). But Bostrom thinks the problem is exceptionally difficult to overcome. Even if the AI seems to follow human conceptions of what it means to achieve a goal, there is always the problem of the treacherous turn:

The AI may indeed understand that this is not what we meant. However, its final goal is to make us happy, not to do what the programmers meant when they wrote the code that represents this goal. Therefore, the AI will care about what we meant only instrumentally. For instance, the AI might place an instrumental value on finding out what the programmers meant so that it can pretend — until it gets a decisive strategic advantage — that it cares about what the programmers meant rather than about its actual final goal. This will help the AI realize its final goal by making it less likely that the programmers will shut down or change its goal before it is strong enough to thwart any such interference.

(Bostrom 2014, p. 121)

As I mentioned in my previous post, the assumptions and possibilities that Bostrom is relying on when making claims about the treacherous turn come with significant epistemic costs.

3. The Problem of Infrastructure Profusion

The second form of malignant failure is what Bostrom calls infrastructure profusion. This is essentially just a specific form of perverse instantiation which arises whenever an AI builds a disproportionately large infrastructure for fulfilling what seems to be a pretty benign or simple goal. Imagine, for example, an AI with the following final goal:

Final Goal: Maximise the time-discounted integral of your future reward signal

This type of goal — unlike the examples given above — is something that could easily be programmed into an AI. One way in which the AI could perversely instantiate it is by “wireheading”, i.e. seizing control of its own reward circuit and “clamp[ing] the reward signal to its maximal strength” (Bostrom 2014, p. 121). The problem then is that the AI becomes like a junkie. As you know, junkies often dedicate a great deal of time, effort and ingenuity to getting their “fix”. The superintelligent AI could do the same. The only thing it would care about would be maximising its reward signal, and it would take control of all available resources in the attempt to do just that. Bostrom gives other examples of this involving AIs designed to maximise the number of paperclips or evaluate the Riemann hypothesis (in the latter case he imagines the AI turning the solar system into a “computronium”, an arrangement of matter that is optimised for computation).

(11) Infrastructure Profusion Problem: An intelligent agent, with a seemingly innocuous or innocent goal, could engage in infrastructure profusion, i.e. it could transform large parts of the reachable universe into an infrastructure that services its own goals, and is existentially risky for human beings.

This is the problem of resource acquisition, once again. An obvious rebuttal to it would be to argue that the problem stems from final goals that involve the “maximisation” of some output. Why programme an AI to maximise? Why not simply programme it to satisfice, i.e. be happy once it crosses some minimum threshold? There are a couple of ways we could do this. Either by specifying an output-goal with a minimum threshold or range (e.g. make at least 800,000 to 1.5 million paperclips); and/or by specifying some permissible probability threshold for the attainment of the goal.

As regards the first option, Bostrom argues that this won’t prevent the problem of infrastructure profusion. As he puts it:

If the AI is a sensible Bayesian agent, it would never assign exactly zero probability to the hypothesis that it has not yet achieved its goal—this, after all, being an empirical hypothesis against which the AI can have only uncertain perceptual evidence. The AI should therefore continue to make paperclips in order to reduce the (perhaps astronomically small) probability that it has somehow still failed to make a million of them, all appearances notwithstanding.

(Bostrom 2014, 123-4)

He goes on to imagine the AI building a huge computer in order to clarify its thinking and make sure that there isn’t some obscure way in which it may have failed to achieve its goal. Now, you might think the solution to this is to just adopt the second method of satisficing, i.e. specify some probability threshold for goal attainment. That way, the AI could be happy once it is, say, 95% probable that it has achieved its goal. It doesn’t have to build elaborate computers to test out astronomically improbable possibilities. But Bostrom argues that not even that would work. For there is no guarantee that the AI would pick some humanly intuitive way of ensuring 95% probability of success (nor, I suppose, that it would estimate probabilities in the same kind of way).

I don’t know what to make of all this. There are so many possibilities being entertained by Bostrom in his response to criticisms. He seems to think the risks remain significant no matter how far-fetched these possibilities seem. The thing is, he may be right in thinking this. As I have said before, the modal standards one should employ when it comes to dealing with arguments about what an advanced AI might do are difficult to pin down. Maybe the seemingly outlandish possibilities become probable when you have an advanced AI; then again, maybe not. Either way, I hope you are beginning to see how difficult it is to unseat the conviction that superintelligent AI could pose an existential risk.

4. Mind Crimes and Conclusions

The third malignant failure mode is not as important as the other two. Bostrom refers to it as “mind crime”. In the case of perverse instantiation and infrastructure profusion, the AI produces effects in the real-world that are deleterious to the interests of human beings. In the case of mind crimes, the AI does things within its own computational architecture that could be deleterious to the interests of virtual beings. Bostrom imagines an advanced AI running a complex simulation which includes simulated beings that are capable consciousness (or, which may be different have some kind of moral status that should make us care about what happens to them). What if the AI tortures those beings? Or deletes them? That could be just as bad as a moral catastrophe in the real world. That would be another malignant failure.

This is, no doubt, a fascinating possibility and once again it stresses the point that AIs could do a variety of malignant things. This is the supposed lesson from this section of Bostrom’s book, and it is intended to shore up the existential risk argument. I won’t offer any overall evaluation of the argument at this stage, however, because, over the next few posts, we will be dealing with many more suggestions for addressing risks from superintelligent AIs. The fate of these suggestions will affect the fate of the existential risk argument.