The AI safety problem is wanting

Jun 26

(not knowing or succeeding)

48 Comments

> Perhaps we could ask the AI to create a “proof” that all proposed actions are safe and run that proof by a panel of skeptical “red-team” AIs. If any of them find anything confusing at all, reject.

> I find the idea that “drawing a safe boundary is not that hard” fairly convincing for human values, but not only semi-convincing for predicting the effects of actions. So I’d like to see more debate on this point. (Did I mention that this is the weakest part of my argument?)

Enumerating all the reasons why some action is safe or all the ingredients of the boundary is a hard problem. E.g see this DARPA program on Scientific Feasibility https://drive.google.com/file/d/1pyYB6zNCiWtmLCXleKmF3IplJ1Mdj_oo/view?usp=sharing

The kinds of AI agents I am excited about are ones that are much better than humans at specifying values — biomedical researchers are terrible at identifying underspecified problems which is where most failures in predicting the effects of actions comes from.

Expand full comment

Victualis

Jun 29

Regarding "8. Drawing that boundary isn't that hard." is where I get off the train. It's insanely hard and often impossible. How do you reconcile Genghis Khan, the CCP, the Juche idea, Savonarola, Albanian paganism, animism, Natural Law, the Dreaming, Idi Amin, Constitutionalism, Aztec human sacrifices, and Nietzsche? The solutions to this are things like "get rid of pesky humans with their contradictory ideas" and other horrific outcomes. We should definitely not seek to align AI with "humans", because that leads to very bad places. Californian values are not universal (even within California), although it might be less incoherent to try to align AI to a yoga-loving squishy socialist with they/them pronouns, but even this is explicitly being resisted (see xAI and non-US LLMs).

Expand full comment

Reply (1)

dynomight

Jun 29

Help me understand—how does the fact that Genghis Khan killed millions of people demonstrate that it would be hard for an AI that wants to respect human values to draw a conservative boundary around its actions?

Expand full comment

Reply (1)

Victualis

Jun 29Edited

Because every weird human outlier forces the conservative boundary to be smaller and smaller, into the narrow space of possibilities that includes "just get rid of all humans" as the simplest solution. We should not be pushing RL-trained systems to approach regions that seem obviously dangerous. It's bad enough that we might inadvertently push these systems into regions where we don't see the danger, but to actively advocate for obviously dangerous goals seems insane.

(Edit: I'm not focusing here so much on Genghis Khan's death toll, but on his success in subjugating a vast geographic area and inspiring many followers, while deprioritizing the secondary costs such as traumatised people causing long term instability, rape as a tool of conquest leading to potentially suboptimal genetic makeup for the regions subjugated that evolution then has to spend thousands of years compensating for, and raising the salience of heroic conquest over other strategies after Plato and Aristotle had pushed back against this a few centuries earlier.)

Expand full comment

Ppau

Jun 27

Am I right in understanding your idea of conservatism as "when your situation is out of distribution, stand by and wait for instructions"?

If so, wouldn't that create another problem when not doing anything is not an option?

Like, if an asteroid is heading to Earth, is the more conservative option to do nothing? Or to destroy it, say with nukes? What if we take into account the risk that the rocket explodes?

(I don't think this is a good example of an out-of-distribution situation but hopefully my point is still understandable)

I have a related problem with political categories: for example, some people see Pinker as a progressive because he likes the way society is evolving and would like the world to progress faster

But others see him as conservative because he's satisfied with the *direction* in which society is evolving and doesn't want to change course

We tend to see conservatism as wanting to keep the state of the world constant. But just as particles have positions and momenta, how the world "evolves" is part of how the world "is"

Expand full comment

Reply (2)

Ppau

Jun 27

You're right, that why my asteroid example doesn't work

I should have been thinking of a situation that could only arise in a world with AGI, but maybe your arms race example is the only one

I still feel like my Pinker argument is different though

Expand full comment

Reply (1)

MoltenOak

Jun 27

You've replied to yourself, which was probably unintentional FYI

Expand full comment

dynomight

Jun 27

Yeah, that's roughly the idea. If it would create problems depends on how you look at it. Say the asteroid is heading for earth and the AI stands by and waits for instructions. Well, that's no worse off than we are now with no AI to help us at all.

So being conservative definitely doesn't come without some kind of cost! But if you're worried about catastrophic outcomes, this would at least guarantee that AI would make our situation strictly better than it is now.

Expand full comment

Joshua Blake

Jun 27

While "want" might be a useful abstraction for an AI's behaviour, it seems way too poorly defined to specify the technical problem of AI safety. Current LLMs (in deployment) are, as I understand it, a deterministic algorithm to generate a probability distribution over tokens followed by some method to pick a token from that distribution. How does "want" map onto that framework? Want does seem better defined in the training phase where there's an explicit objective function being optimised by stochastic gradient descent (or whatever algorithm), but without mesa-optimisers I don't get it in deployment.

Expand full comment

Reply (1)

dynomight

Jun 27

I think I'd put it this way: It's entirely possible that the result I'm claiming here is true but not in any way useful. I agree with that, but if the result is indeed true, it seems worth thinking about just in case it turns out to be useful.

Expand full comment