Defining “Benevolence” in the context of Safe AI

There are many problems with this You-Just-Can’t-Define-It position. First, it ignores the huge common core that exists between different individual concepts of benevolence, because that common core is just … well, not worth attending to! We take it for granted. And ethical/moral philosophers are most guilty of all in this respect: the common core that we all tend to accept is just too boring to talk about or think about, so it gets neglected and forgotten. I believe that if you sat down to the (boring) task of cataloguing the common core, you would be astonished at home much there is.

The second problem is that the common core might (and I believe does) tend to converge as you tend toward people who are on the “extremely well-informed” + plus “enpathic” + “rational” end of the scale. In other words, it might well be the case that as you look at people who know more and more about the world, who have strong signs of empathy toward others, and who are able to reason rationally (i.e. are not fanatically religious), you might well find that the convergence toward a common core idea of benevolence becomes even stronger.

Lastly, when people say “You just can’t DEFINE good-vs-evil, or benevolence, or ethical standards, or virtuous behavior….” what they are refering to is the inability to create a closed-form definition of these things.

That is, you cannot define them in such a way that the form of words fits into a dictionary entry of no more than a page, and covers the cases so well that 99.99% of the meaning is captured. There seems to be an assumption that if a closed-form definition exists, then the thing does not exist. This is Definition Chauvinism. It is especially prevalent among people who are dedicated to the idea that AI is Logic. That meanings can be captured in logical propositions; that the semantics of the atoms of a logical language can be captured in some kind of computable mapping of symbols to the world.

But even without a closed-form definition, it is possible for a concept to be captured in a large bumber of weak constraints. To people not familiar with neural nets I think this sometimes comes as a bit of a shock. I can build a simple backprop network that captures the spelling-to-sound correspondences of all the words in the English language, and in that network the hidden layer can have a pattern of activation that “defines” a particular word so uniquely that it is distinguished massively and completely from all other words.

And yet, when you look at the individual neurons in that hidden layer, each one of them “means” something so vague as be utterly undefinable. (Yes, NN experts among you: it depends on the number of units in the hidden layer and how it is trained: but for just the right choice of layer sizes, the patterns can be made distributed in such a way that interpretations are pretty damned difficult). In this case, the pronunciation of a given word can be considered to be “defined” by the sum of a couple of hundred factors, EACH OF WHICH is vague to point of banality. Certainty, in other words, can come from amazingly vague inputs that are allowed to work together in a certain way.

Now, that “certain way” in which the vague inputs are combined is called “simultaneous weak constraint relaxation”. Works like a charm. If you want another example, try this classic, courtesy of Geoff Hinton: “Tell me what X is, after I give you three facts about X, and if I tell you ahead of time that the three facts are not only vague, but also one of them (I won’t tell you which) is actually FALSE! So here they are: (1) X was an actor. (2) X was extremely intelligent. (3) X was a president.”

(Most people compute what X is within a second. Which is pretty amazing that X could have been anything in the whole universe, and they were given an utterly lousy definition of it.)

So what is the moral of that? Well, the closed-form definition of benevolence might not exist (in just the same way that there is virtually no way to produce a closed-form definition of what how the pronunciation of a word relates to its spelling, if the “words” of the language you have to use are the hidden units of the network capturing the spelling-to-sound mapping. And yet, those “words” when combined in a weak constraint relaxation system allow the pronunciation to be uniquely specified. In just the same way, “benevolence” can be the result of a lot of subtle, hard-to define factors, and it can be extraordinarily well-defined if that kind of “definition” is allowed.

Practical implication of this, so you can come down from all this abstract theory: if we built two different neural nets each trained in different ways to pick up the various factors involved in benevolence, but they were given a large enough data set, we might well find that EVEN THOUGH the two nets have built up two completely differents sets of wiring inside, and EVEN THOUGH the training sets were not the same, they might converge so closely that if they were tested on a million different “ethical questions”, they might only disagree on a handful of fringe cases.

Note: I am just saying “this might happen”, at this stage, because the experiment has not been done ….. but that kind of result is absolutely typical of weak constraint relaxation systems, so I would not be surprised if it work a charm. And so now, if we assume that that did happen, what would it mean to say that “benevolence is impossible to define”?

I submit that that assertion would mean nothing. It would be true of “define” in the sense of a closed-form dictionary definition. But it would be wrong and irrelevant in the context of the way that weak constraint systems define things.

To be honest, I think this constant repetition of “benevolence is undefinable” is a distraction and a waste of our time.

You may not be able to define it.

But I am willing to bet that a systems builder could nail down a constraint system that would agree with virtually all of the human common core decisions about was consistent with benevolence and what was not.