The Three Goals, Game Theory, and Western Civilization

This kicked off some interesting discussion, including some debate as to whether my "goals" really aren't just rules rephrased. In which case, the argument went, they probably wouldn't help all that much. Michael Anissimov commented:

What would work better would be transferring over the moral complexity that you used to make up these goals in the first place.

Also, as you point out, these goals are vague. More specific and useful from a programmer's perspective would be some kind of algorithm that takes human preferences as inputs and outputs actions that practically everyone sees as reasonable and benevolent. Hard to do, obviously, but CEV (http://www.singinst.org/upload/CEV.html) is one attempt.

That's really the crux. Moral complexity does exist in algorithmic form...within our brains. And that goes to the difference between laws and goals. My goals are what I'm trying to do, both morally and in other areas. There are some sophisticated software programs running in my brain made up of things that I've been taught, things I've figured out for myself, and things that are built in. All of these add up to provide me the tendency to act a certain way in a certain situation. The strategies that drive that software are my moral goals.

Laws, on the other hand, exist outside of myself. I am not specifically programmed to do unto others as I would have them do unto me. I have some tendencies in that direction, but there's nothing stopping me from acting otherwise, and -- let's face it -- I often do. I have tendencies to be nice, fair, just, etc., but I also have tendencies to try to get what I want, to get even with those who have wronged me, to try to be a bigshot, and so on. These tendencies compete with each other, and my behavior overall is some rough compromise.

An artificial general intelligence (AGI) built as a reverse-engineered human intelligence would be in the same position. It would have the "moral complexity" Michael mentioned, but also the baggage of competing tendencies. You could no more guarantee such an intelligence's compliance with a rule or set of rules than you could a human being's.

A law like the Golden Rule is a high-level abstraction of certain strategies (algorithms) that produce a desired set of results. On a conscious level, I can use that abstraction to determine whether my behavior is where I want it to be:

Wife complained of being chilly when I got up at 5:00 AM to work out. Covered her with blanket. Good.

Sped up on highway in attempt to keep a guy trying to merge from going ahead of me. Not so good.

Commenter on blog revealed that he doesn't really understand the subject at hand. Ripped him to shreds. Bad.

Through discipline and practice, I can "program myself" with it to try to move my tendencies in that direction. But I can't write it into my moral source code and set it as an unbreakable behavioral rule. That's partly because it's too vague and partly because I simply lack that capability.

Presumably, I could be externally constrained always to follow the Golden Rule, no matter what. If my actions were being constantly monitored, and I was told that the I would be killed immediately upon violating the rule...I'd certainly do my best, now wouldn't I?

Still, I'd have a hard time believing that anyone holding me in such a position was much of a practitioner of that rule him or herself. If the people trying to enforce the rule on me in this manner told me that it was for my own good -- that they were trying to make me a better person -- I don't know that I'd buy it. And if I figured out that they were only doing this to protect themselves from harm I might to do to them, I think I would pretty annoyed with them (to say the least.)

I would expect a reverse-engineered human intelligence to feel the same way, so I don't think attempting to constrain an AGI in such a manner would be a particularly good idea, especially not if we have a reasonable expectation that it will eventually be smarter and more powerful than us. On the other hand, letting it use the process I described above -- evaluating its own behavior against a defined standard -- an AGI might achieve far better results than I have, if only because it can think faster and would have much more subjective time in which to act. This is the notion of recursive self-improvement that matoko kusanagi referred to. The trouble with recursive self-improvement on its own, as Eliezer Yudkowsky and others have pointed out, is that if the AI starts "improving" in a direction that's bad for humanity, things could get out of hand pretty quickly.

If the artificial intelligence is a modified version of human intelligence, or new intelligence built from scratch, we raise the possibility of building a moral structure into the intelligence, rather than trying to enforce it from outside. That's the idea behind the the Three Laws and my Three Goals -- that they would somehow be built in. But they certainly can't be built in in anything like their current form. Michael Sargent (and others) pointed out the weakness of that approach, the less important goals have to take the back seat to the more important ones:

Each Goal must have a clear and unbreakable priority over the others that follow it and thus, in the order stated, collective continuity trumps individual safety ("The needs of the many outweigh the needs of the few, or the one."), individual safety (broadly construed, 'stasis' ) trumps individual liberty ('free will' ) , and happiness ('utility', a notoriously slippery concept for economists and philosophers to get a firm intellectual grip on) trumps both individual liberty and individual well-being (allowing potentially self-destructive behavior on the individual level insofar as that behavior doesn't exceed the standard established for 'safety' in Goal 2).

I see the reasoning here, but I'm not 100% convinced. Consider the goals that drive a much simpler AI, system -- the autopilot system found on any jet airliner. The number one unbreakable goal has got to be don't crash the plane. But there are many other goals that might drive such a system:

Don't move in such a way as to make the passengers sick.

Don't waste fuel.

In landing, don't go past the end of the runway.

Above all, the system will seek to ensure that first goal. But within the context of ensuring that first goal, it also has to do everything it can to ensure the others. And, yes, it can and must sacrifice the others from time to time in service of the first. So the plane might temporarily move in a nauseating way, or it might waste fuel, or it might even slide past the end of the runway if doing any of those things help ensure the first goal.

Reader TJIC suggested that an AI programmed to meet the Three Goals as I defined them...

1. Ensure the survival of life and intelligence.

2. Ensure the safety of individual sentient beings.

3. Maximize the happiness, freedom, and well-being of individual sentient beings.

...would end up creating a nanny state wherein human freedom is always sacrificed to individual safety. And he may well have a point, but I would argue that just as an autopilot can be calibrated to allow whatever what we deem the appropriate relationship between having the flight not crash and not make us sick, so could these three goals be calibrated in such a way so as to maximize human freedom within an acceptable level of individual risk -- whatever that might be.

Getting back to the vagueness problem, it's hard to calibrate the goals as stated, seeing as they are written in an awkward pseudo-code that we call human language. If we want to improve on the algorithms that are built into human intelligence, or develop entirely new ones -- in other words, if we're going to come up with algorithms that will provide us the ends stated in the goals -- we're going to have to do it mathematically.

But that isn't necessarily going to be an easy thing to do. Eliezer Yudkowsky argues that developing an AI and setting it to work on doing some good thing are relatively easy compared to the third crucial step, making sure that that friendly, well-intentioned AI doesn't accidentally wipe us out of existence while trying to achieve those good ends:

If you find a genie bottle that gives you three wishes, it's probably a good idea to seal the genie bottle in a locked safety box under your bed, unless the genie pays attention to your volition, not just your decision.

Again, I think this goes to the issue of calibration of the system. Eliezer wants to calibrate what the AGI does with the coherent, extrapolated volition of humanity. Volition is an extremely important concept. Earlier, I mentioned the golden rule. If I decide that I'm going to do unto others as I would have them do unto me, I might start handing out big wedges of blueberry pie to everybody I see. After all, I like pie and I would love it if people gave me pie. But if I give my diabetic or overweight or blueberry-allergic friends a wedge of that pie, I wouldn't be doing them any favors. Nor would I be doing what I wanted to do in the deepest sense.

Eliezer describes the concept of extrapolated volition as meaning not just what we want, but what we would want if we knew more, understood better, could see farther. Coming up with a coherent extrapolated volition for all of humanity is a tall order, especially if we're doing it not just for the sake of conversation, but in order to enable a system which will try to realize that which is within our volition.

I like to think that humanity's CEV would look a lot like the three goals that I've written. And I honestly believe that the algorithms that power human progress do work, in a rough and general way, towards those goals, which is why people are generally freer, safer, and happier than they have been in the past -- though obviously not without many, many, appalling and horrific exceptions. So perhaps our calibration efforts involves feeding the AGI algorithms that will enable it to speed our progress towards those goals while cutting the exceptions way down. Or eliminating them, if that's somehow possible.

So to finally come around to it, what will those algorithms look like?

Maybe we can take hint from the study of Game Theory. Robert Axelrod held two tournaments in the early 1980's in which computer programs competed against each other in an attempt to identify the optimal winning strategy for playing the iterative version of the the famous Prisoner's Dilemma. In the one-off version of the game, the optimal strategy is to screw the other guy. (This is not the sort of thing we want to go teaching the AGI, at least not in isolation!) However, when multiple rounds of the game are played, something else begins to emerge:

By analysing the top-scoring strategies, Axelrod stated several conditions necessary for a strategy to be successful.

Nice

The most important condition is that the strategy must be "nice", that is, it will not defect before its opponent does. Almost all of the top-scoring strategies were nice. Therefore a purely selfish strategy for purely selfish reasons will never hit its opponent first.

Retaliating

However, Axelrod contended, the successful strategy must not be a blind optimist. It must always retaliate. An example of a non-retaliating strategy is Always Cooperate. This is a very bad choice, as "nasty" strategies will ruthlessly exploit such softies.

Forgiving

Another quality of successful strategies is that they must be forgiving. Though they will retaliate, they will once again fall back to cooperating if the opponent does not continue to play defects. This stops long runs of revenge and counter-revenge, maximizing points.

Non-envious

The last quality is being non-envious, that is not striving to score more than the opponent (impossible for a ‘nice’ strategy, i.e., a 'nice' strategy can never score more than the opponent).

Therefore, Axelrod reached the Utopian-sounding conclusion that selfish individuals for their own selfish good will tend to be nice and forgiving and non-envious. One of the most important conclusions of Axelrod's study of IPDs is that Nice guys can finish first.

Bill Whittle has written recently that the qualities listed above underpin western civilization, and help to explain why the West has out-competed other civilizations, who operate using different strategies:

Now, this is where my own analysis kicks in, because frankly, nice, retaliating, forgiving and non-envious pretty much sums up how I feel about the West in general and the United States in particular. The web of trust and commerce in Western societies is unthinkable in the Third World because the prosperity they produce are fat juicy targets for people raised on Screw the Other Guy. Crime and corruption are stealing, and stealing is Screwing the Other Guy. It’s short-term win, long-term loss.

I would add that if we look at the three goals as goals for humanity rather than for artificial intelligence, we see better progress towards them in western societies than elsewhere. In the tournament, the winning strategy, embodying all of the above characteristics, was called tit-for-tat. Interestingly, the computer program driving that strategy consisted of only four lines of BASIC code. That's very interesting, and it suggests a startling possibility -- like a simple recursive formula producing a complex Mandelbrot image, the moral complexity we're looking for might just be packed into a very simple set of mathematical relationships.

So in order to develop and calibrate an Artificial General Intelligence that carries out our three top goals (or that helps us to achieve our coherent extrapolated volition) one of the important parameters to explore is how the AI relates to us and to other AIs. The secret might ultimately lie in playing nice with the AI, and teaching it to play nice with us and with other AIs. Not just because we want it to be nice, but because nice turns out to be -- at a mathematical level -- the best way to play.

Phil Bowermaster lives in Colorado where he works as the worldwide product marketing manager for Sybase IQ. He blogs at The Speculist and co-produces the Fast Forward Radio podcast.