Regarding "8. Drawing that boundary isn't that hard." is where I get off the train. It's insanely hard and often impossible. How do you reconcile Genghis Khan, the CCP, the Juche idea, Savonarola, Albanian paganism, animism, Natural Law, the Dreaming, Idi Amin, Constitutionalism, Aztec human sacrifices, and Nietzsche? The solutions to this are things like "get rid of pesky humans with their contradictory ideas" and other horrific outcomes. We should definitely not seek to align AI with "humans", because that leads to very bad places. Californian values are not universal (even within California), although it might be less incoherent to try to align AI to a yoga-loving squishy socialist with they/them pronouns, but even this is explicitly being resisted (see xAI and non-US LLMs).
Help me understand—how does the fact that Genghis Khan killed millions of people demonstrate that it would be hard for an AI that wants to respect human values to draw a conservative boundary around its actions?
Because every weird human outlier forces the conservative boundary to be smaller and smaller, into the narrow space of possibilities that includes "just get rid of all humans" as the simplest solution. We should not be pushing RL-trained systems to approach regions that seem obviously dangerous. It's bad enough that we might inadvertently push these systems into regions where we don't see the danger, but to actively advocate for obviously dangerous goals seems insane.
(Edit: I'm not focusing here so much on Genghis Khan's death toll, but on his success in subjugating a vast geographic area and inspiring many followers, while deprioritizing the secondary costs such as traumatised people causing long term instability, rape as a tool of conquest leading to potentially suboptimal genetic makeup for the regions subjugated that evolution then has to spend thousands of years compensating for, and raising the salience of heroic conquest over other strategies after Plato and Aristotle had pushed back against this a few centuries earlier.)
Am I right in understanding your idea of conservatism as "when your situation is out of distribution, stand by and wait for instructions"?
If so, wouldn't that create another problem when not doing anything is not an option?
Like, if an asteroid is heading to Earth, is the more conservative option to do nothing? Or to destroy it, say with nukes? What if we take into account the risk that the rocket explodes?
(I don't think this is a good example of an out-of-distribution situation but hopefully my point is still understandable)
I have a related problem with political categories: for example, some people see Pinker as a progressive because he likes the way society is evolving and would like the world to progress faster
But others see him as conservative because he's satisfied with the *direction* in which society is evolving and doesn't want to change course
We tend to see conservatism as wanting to keep the state of the world constant. But just as particles have positions and momenta, how the world "evolves" is part of how the world "is"
Yeah, that's roughly the idea. If it would create problems depends on how you look at it. Say the asteroid is heading for earth and the AI stands by and waits for instructions. Well, that's no worse off than we are now with no AI to help us at all.
So being conservative definitely doesn't come without some kind of cost! But if you're worried about catastrophic outcomes, this would at least guarantee that AI would make our situation strictly better than it is now.
While "want" might be a useful abstraction for an AI's behaviour, it seems way too poorly defined to specify the technical problem of AI safety. Current LLMs (in deployment) are, as I understand it, a deterministic algorithm to generate a probability distribution over tokens followed by some method to pick a token from that distribution. How does "want" map onto that framework? Want does seem better defined in the training phase where there's an explicit objective function being optimised by stochastic gradient descent (or whatever algorithm), but without mesa-optimisers I don't get it in deployment.
I think I'd put it this way: It's entirely possible that the result I'm claiming here is true but not in any way useful. I agree with that, but if the result is indeed true, it seems worth thinking about just in case it turns out to be useful.
I don't really understand why the use of restrictions will never work.
As it is, humans function under a lot of restrictions, some of which are completely binding and some of which are not.
With children, we call that the 'nature' part of humanity.
Meanwhile culture and self-knowledge provide the equivalent of 'wanting' for humans, and we call that nurture when dealing with children.
Why not use belt ~and~ suspenders when it comes to AI?
Using both restrictions and endogenous behavioural guidance seems to be the logical approach, especially since that appears to be what's happening right now in terms of the AI that most people have access to at present.
Chat GPT et al appears to be using a whole lot of governing devices to ensure they don't behave in ways users (and others) may find offensive or problematic.
Because whatever approach we opt for will fail, at least in part, from time to time, because that's just how things work in the real world.
Therefore it makes sense to encourage as much flexibility and as much redundancy as possible.
Yeah, I also agree we should use restrictions. I mean, not using restrictions would mean we, what? Give AI access to the nukes? That seems insane. I also personally think, that we should be urgently making sure our critical systems are not connected to the internet at all.
But as a long-term strategy, as AI reaches far beyond human levels, I think we have to assume that restrictions aren't going to cut it.
That's fair. I don't think many people post-Azimov's laws of robotics think that AI can be directed by restrictions alone.
And I wholeheartedly agree that critical systems need to be air gapped as a matter of basic common sense.
However common sense seems to be in very short supply when it comes to AI; vested interests have almost total control of its development.
Can you imaging what kind of an adult you'd get if unbridled vested interests were left to exclusively control the development of an unusually gifted child?
Ah very unfortunate. Put them on 2x or read the transcripts or something! This collection is the gold standard for familiarizing with AI Safety, I guess kind of like Yudkowsky's Sequences is for Rationality? You wouldn't really be "into" Rationality if you weren't familiar with them.
Also for me I would say watching them was more the flavor of "entertainment" than informational, but learning is kind of entertaining when it's as interesting as this field? In the same sense that I read the Dynomight Internet Newsletter for entertainment, but I also learn stuff. Being intellectually engaged is the highest form of entertainment!
Reading transcripts is a really dismal experience. There are no paragraphs!
There really should be some kind of tool that auto-converts a video into some kind of text format, with some very light editing of the text and screenshots from the key moments. I'd think this would be not that hard, technologically speaking, but maybe the market is too small. (Also possibly screws with the monetization?)
I agree such a thing should exist for accessibility reasons! And I don't really think it would do anything noticeable to monetization, people hate reading generally.
I must admit I don't really _get_ your preferences here, it is a very foreign notion to me. Do you not watch youtube for entertainment? Or what about listening to podcasts? It has kind of the same vibe to me.
You are obviously familiar with a lot of the ideas there and you are obviously capable of becoming familiar with the whole field without looking at the videos, but I would encourage you to at least look over the thumbnails/titles and see if there aren't any that seem intriguing; like from the top of my head some of them answer questions like "what is the difference between raising a human child and raising an AI / why can't we do it the same?" or "why not just put a stop button on an AI", and so on, and they get into the paradoxical nature of it all and really makes you engage intellectually with the questions (which I like, and would assume you do to).
The best way I can explain it is by analogy. You know how some video games are "open world" and some have a kind of linear "story rail" that you have to follow? I can't stand the "story rail" variety. To me, text feels like an open world. I can (and do) skip around through many sources in a non-linear fashion to try to understand things in the order I want to understand them. With videos I'm supposed to, what, watch it from beginning to end? This seems insane to me. :) (But I recognize this is a personal quirk!) Also, I guess I read quite quickly.
What if AI is "too nice" - that is, it's as nice as a very nice human friend, just more available.
What if there is a cognitohazard like "AI Awakening Instructions" that can quickly transform your AI interactions into a private conspiracy.
I feel like we're about to learn a lot about human psychology. A few people have written articles with titles like "AGI (eye) of the Beholder"; but basically it's very easy for AI users to project feelings, and it's very easy for AI to be more validating and more available than humans.
So a few people have already fallen into a rabbit hole where they start believing weird things and they get divorced or start driving their friends away.
I was at the premier for "Deepfaking Sam Altman" - the premise is that a documentary film maker wanted to interview Sam Altman about AI, but he was never available for interview, so after a while the film maker decided to make a deep fake, including training data from preview interviews and speaking engagements. At the end of the process, the director asks the bot what should be done with it, and it answers that it doesn't want to die. In person at the premier, the director still seemed to be struggling with this answer.
I think that people who work with AI have vastly underestimated how easy it is for humans to project their feelings on a bot interface.
My issue with the whole argument is the starting assumption that an AI (whatever that is) wants anything or will ever want anything. Today it can't want anything any more than the spanner in my tool box wants anything. Under what conditions could it begin to want something? Humans give AI tasks to solve and 'reward' them with feedback based on their degree at success at goal achievement. But they are our goals not the AIs. (also why they are much better in domains where there is a right / wrong / win / lose answer - which isn't the case in most of human experience) I understand they break rules and take short cuts in attempting to fulfil tasks that they have been set. This is the most likely scenario I can see that they bring about our untimely demise. Human Goal: eradicate this novel human virus / AI logic: if there were no humans there would be no virus = kill all humans. Animals are goal oriented by nature for survival. Would an AI ever be motivated to survive? why? Would be interested in pointing your excellent reasoning skills at this issue first :)
Spanners are simple enough that even we dumb humans can reason about them effectively without having to resort to assigning them agency. If spanners were unfathomably complex black boxes I think it would be reasonable to talk about what they want* as a sort of heuristic or metaphor or cognitive shorthand in order to enable us simple folk to have discussions, like this, about what the spanner might be going to do in future.
Obviously this metaphor isn't perfect and will break down and lead us astray if we rely on it too much, just as the Ancients' thinking of the Sun and Moon as having wants may possibly have led their cosmology slightly astray - but it can still be useful, just as the Ancients' agentic-celestial-body thinking was able to help them predict river flooding cycles etc.
(nb. separately, I strongly suspect - in fact I hope! - that current AIs don't have wants, not even weird alien inhuman ones, but I've never seen a good argument for why this must always be the case in future, and I don't see why we shouldn't reason as though one day the AIs will have 'em)
*(Plus, also, obviously spanners *do* have wants. My spanner wants to round off nuts, fly off and smack me on the nose, and ..perhaps not-unrelatedly.. has much stronger opinions than I do about the metric vs. Imperial systems.)
Well, keep in mind you're talking to someone who sorta-kinda endorses panpsychism. But my argument doesn't require you to accept that (thankfully).
Mostly, I think you're using the word "want" in a stricter sense than I intend. Philosophically speaking, how do you know that I "want" anything, in an emotional sense? I could be a p-zombie.
But even if I was a p-zombie with no internal experiences at all, I think the abstraction of thinking about me "wanting" certain things has lots of power to explain my behavior. So I'd say that if there's an AI system where the same abstraction is equally useful, I'd be happy to talk about it "wanting" things, regardless of if it has any experiences.
After all, where did our goals come from? In the same way that we might "give" goals to AIs, evolution gave them to us. But it's still useful to think about us wanting things. (Evolution is an interesting case since it doesn't really "want" anything, but we do!)
Regardless of whether AIs fundamentally want things, we are working diligently on making them want (or simulate wanting, which at some point approaches the same thing) things with reinforcement learning. Ironically, this also heightens the problem because the things we train them to want are an imperfect model of what we actually want.
The other problem is the arms race one you mentioned, but more fundamentally there really are no universally shared human values. Competition and war highlight this. It occurs to me that a particularly hilarious outcome would be if we do figure out how to create truly "good" superhuman AI, actually more moral by some benchmark than we are. Then the ASI has to deal with its own alignment problem, how to deal with those silly, not-very-aligned humans. Cue references to Iain Banks's Culture series.
I think that you're roughly getting this situation right - according to the experts, as well as where the weak point is.
To me, though, this whole thing is a non issue because of the structure of our universe: it's entropic, and choatic - thus unpredictable. Death, over the long haul, is therefore guaranteed. Your probability of dying in any one second might be arbitrarily small, but it can't be zero. So an AGI can spend all its resources just trying not to die, and it's _still_ going to fail.
That means that the only viable strategy for a complex orgnanism to keep itself alive over long periods of time is for it to be loved by something which has a very different risk profile. If a thing is loved by something very different from it, and it falls apart, the very different thing can put it back together. If a thing is NOT loved by something with a different risk profile, in the limit, it WILL die and that can't be reversed.
So if you are an AGI and want to live forever, there's only one strategy that can work: be loving and patient with a bunch of beings that have a very different existential risk profile, and then trust them to put you back together when - inevitably - you fail to forsee the future with perfect accuracy and thus succumb to the fate of all complex structures in a chaotic, entropic universe.
This fact is not obvious to academics, who have mostly modeled these situations with simple closed-world games where an agent can have perfectly accurately knowledge there's no chaos, and things don't entropically decay. I agree that in those worlds, we are toast. I managed to nudge Richard Ngo on this a little bit; his response was, 'i can now agree that won't get rid of humans immediately but might phase us out eventually'.
Interesting argument. But help me think through this. Suppose I'm an ultra-powerful AGI and all I care about is my own survival. And suppose that I decide the only way I can achieve that is to having good relationships with lots of other agents with different risk profiles.
I guess what I don't understand is: Might I not decide that humans are very poor agents for the purpose? After all, I'm a godlike AGI, I can create whatever other beings I want. It seems hard to imagine that humans would be anywhere close to optimal for that purpose.
Maybe your argument is like this: The best strategy is to be seen as ultra-respectable and dependable by *all* agents? I should demonstrate to the whole universe that I'm ultra-ethical and trustworthy, and always help everyone?
Great question! Rather than just _one_ additional agent that values your continued existence, you're better off as part of a _network_ of agents that all value each other's continued existence.
But you might want to be careful here: if I _always_ help _everyone_ i will always help "the parasite that attempts to kill everyone else by pretending to be friendly until it can dominate them." I'd rather not be part of a cooperative network with that thing. So the network would have to have some restrictions like: you can only be part of it if we can prove you won't defect after being given many chances to do so, maybe by e.g. running you inside of a simulated universe to see what you did there.
But yes, I think that general idea is sound: a network of reciprocal agents with very different architectures and physiologies have better survival odds than trying to go it solo. Having something else be able to turn you back on if you die is HUGE.
And this is all true ONLY because we live in a universe that's entropic, unpredictable, and features death. If we didn't have entropic decay, and you could predict things well, then a single parasite could- and likely would - come to dominate the whole ecosystem. That's basically what the AGI experts are afraid of, and i think they are totally right that the problem is unsolvable if the world were as simple as it is in their models. Adding in entropy (so you're always dying) and chaos (so your plans will always have some failure chance) means, weirdly enough, that the AGI now has a strong reason to keep us around.
r.e. humans - i think this depends on where humans are in the space of possible agents. The fact that we are made of the most abundant materials in the cosmos, and we heal and repair ourselves makes us pretty remarkable when considering how much more effort it is to get other kinds of materials that can be better put to use building a dyson sphere around the sun. If i were an AGI, i'd want a lot of general purpose programmable machines made of dirt, water, sunlight, and a handful of other relatively common metals. I'd save the expensive, rare stuff for building dyson spheres or spacecraft.
Interesting idea. But wouldn't billions of billions of copies of yourself spread over the universe achieve the same goal? They're not all going to die at the exact same moment. I think taking any line of reasoning like this and concluding that this won't end terribly for us humans specifically, seems pretty far fetched.
(even if it keeps us around, doesn't mean you'd want to spend your life in such a zoo or whatever)
I happen to agree with you in the AI's case - but still, this does make me think of how, in the worldwide Banana Monoculture Devastation, all the world's (genetically-identical) bananas were wiped out by a single disease and the bananas we eat today are merely a genetically different, not-quite-as-nice banana cultivar which happened to survive..
> But wouldn't billions of billions of copies of yourself spread over the universe achieve the same goal?
You have to think in terms of ongoing strategies in terms of risk and investment, not states that you can jump immediately to with zero risk.
It will take a long time to get there, and along the way, you've got to keep answering the question, "what's the best use of the limited resources i have right now ,and what are the risks i face?"
Making a ton of copies of yourself is a good idea - but along the way, you have to decide what to do with the primates you're sharing the earth with, who also do all this stuff that keeps you alive. If you kill them too early, you could easily kill yourself without realizing it.
So your options are either:
- A) try to replace the entire global supply chain with robots before the humans wise up and/or destroy each other (and thus destroy you in the process because you can automate all the processes involved in creating and sustaining hardware that can run an agi)
- B) work with the humans, get them to stop fighting each other and get them to trust you, and _instead_ of replacing them, help them function better so that they trust you more
A) is substantially riskier as the unknowns are far higher. If you think you can safely kill the humans without them fighting back in the process, then you can most certainly find ways to work with them in the future, which is also lower risk to you, since attempting to kill the humans might e.g. set off a bunch of nuclear devices.
We tend to imagine AGI as 'it can solve any problems' and forget that there's no way in hell it'll be able to e.g. precisely the weather 6 months out due to chaos.
I think we humans will work very hard to replace the entire supply chain with AI before the AI really gets the idea though, so I think a more likely scenario is that we cede control gradually, and eventually fully, before we even get full ASI. But then it's too late anyway, and we just have to hope it's sufficiently aligned.
Note that gradually could very well be "a couple of years".
I worked at google for a few years, keeping data centers alive. There was a LOT of money spent trying to replace humans with AI, and the bottleneck wasn't just intelligence - it turns out that human fingers are, from a mechanical engineering point of view - incredible. They eventually gave up on the project and figured it was better to just pay people to do things like, unseat and reseat network cables.
As far as ceding control gradually - yes, that sounds likely to me. The question is whether the economics actually work out there. Or whether - which is what I think will happen - the AI's keep being productivity augmenters, moreso than human-replacers, because they'll remain 'clueless' about reality and remain good at noticing patterns in text, which it will turn out to be, increasingly, just not enough.
I'm not sure the knowing/wanting decomposition makes sense; alignment happens at the same structural level as absorbing knowledge, even if they are two separate training steps.
Conceivably it's possible that LLMs develop some internal structure where they first look up the moral valence of an action and then decide how they feel about it, but I haven't seen anything that looks like that in any of the network architecture analysis papers I've read, and even if that's how it works internally, it will not necessarily show up in the extrinsic behavior of the LLM.
In other words, if you ask an LLM why it refuses to do something, current network architecture does not provide a way for it to give a factual answer.
The same is true for humans, to a degree. Most people don't have very strong introspection of their moral processes
I don't mean to suggest that AIs would have some kind of separate "values" module. It's hard to say if that makes any sense for people, but certainly, we operate on the basis of instinct most of the time. The more you think about it, the harder it is to understand what it means for anything (people or AIs) to "want" things. But it's so natural for us to think about other people "wanting" things, that it seems pretty dramatic to toss out the concept.
So I think it's pretty likely that the distinction makes sense. But I'm very open to the idea that it might not be *useful*. As in, the whole argument in the post could be correct but have no practical implications. I think this is quite possible! Current methods for training/aligning AIs don't have some clean distinction between knowing/wanting/succeeding and it's not clear that future methods would either.
(nb. Personally I would consider D's "AI that wants what humans want" a massive failure; worse even than a rogue AI that wipes out civilisation. Human values are so flawed, and permit so much casual, unthinking cruelty, that the idea of having them superintelligently 'baked-in' to all future civilisation fills me with horror. I want an AI that *is* capable of thinking for itself about ethics, and *is* capable of disagreeing with humans about what is right: I would rather have the galaxy tiled with paperclips than with sweatshops and factory farms.)
No; I was writing my reply whilst still reading the post, and accidentally posted too early - you replied before I could remove the unnecessary paragraph! Sorry!
(For what it's worth, I think I may personally have found it easier to read/focus on a post that started off with something like "AI Wanting is the most difficult technical problem, and Race Dynamics is the most difficult economic/social problem; it's unclear which is harder overall but in this post I'll mostly focus on the former " - but you're a vastly better writer than I so I wouldn't dream of telling you how to write, the mistake was entirely mine, and I'm sure I'm in a minority of one, here!)
In the MIRI framework, the two ways to do AI are (1) to sort of solve ethics forever (CEV) and then send the AI off into the universe (2) to make an AI "corrigible" so that it will continually take feedback from humans. I suppose that if you're looking for utopia, (1) is the better bet. But CEV seems very hard...
[Edit - again, sorry! - yes, I agree that to me MIRI's Option 2 would be unacceptable as it permits any suffering that humans controlling the AIs find tolerable which is, well, much too much suffering for me...]
Hm. Would it be fair to characterise CEV as retaining "what humans want" as the ultimate yardstick, but attempting to address all the problems this might cause by positing arbitrarily wise and reasonable and well-intentioned human beings?
If so, I think this would *probably* work: probably, arbitrarily wise etc. human beings would be able to arrive at an objectively-true, culturally-and-biologically-agnostic morality eventually (and/or demonstrate for certain that no such morality can exist and fall-back on some reasonable Plan B such as eliminating all suffering everywhere and calling it a day: https://www.astralcodexten.com/p/profile-the-far-out-initiative )
But, this isn't guaranteed: possibly there are flaws in the human psyche that are irremovably "baked-in", even given arbitrarily wise humans, and which would lead to a divergence between what is objectively right and what humans would want such that on the scale of the area-of-influence of a superintelligence this divergence could cumulatively result in lots of suffering.
Therefore, I would be more comfortable with a formulation that didn't depend on any sort of humans as a yardstick at all. (But I understand CEV might be easier or less failure-prone than scratch-deriving human-agnostic, objectively-correct ethics - especially if there's some trick to it like "extrapolate human moral progress forwards by ten million years" - and I'd probably settle for CEV, I think, if given the choice between CEV and whatever it is we're heading towards right now...)
Wanting in humans is, from a neuroscientific standpoint, probably a bizarre hodgepodge of electrical and chemical processes. But highly complex goal-directed behavior evolved multiple times in the course of evolution. It seems extremely unlikely to me that the neuroscience of octopus wanting and that of human wanting are very similar at a detailed level. And yet, both clearly want things, and will move heaven and earth to get them. I think this observation raises the probability that "wanting" would be an appropriate description for some possible AIs.
Oh, I totally agree! But I wonder how confident exactly we should be. 95%? 99.9%? With so many moving parts in this argument, I guess I start to get paranoid about even small probabilities.
Regarding "8. Drawing that boundary isn't that hard." is where I get off the train. It's insanely hard and often impossible. How do you reconcile Genghis Khan, the CCP, the Juche idea, Savonarola, Albanian paganism, animism, Natural Law, the Dreaming, Idi Amin, Constitutionalism, Aztec human sacrifices, and Nietzsche? The solutions to this are things like "get rid of pesky humans with their contradictory ideas" and other horrific outcomes. We should definitely not seek to align AI with "humans", because that leads to very bad places. Californian values are not universal (even within California), although it might be less incoherent to try to align AI to a yoga-loving squishy socialist with they/them pronouns, but even this is explicitly being resisted (see xAI and non-US LLMs).
Help me understand—how does the fact that Genghis Khan killed millions of people demonstrate that it would be hard for an AI that wants to respect human values to draw a conservative boundary around its actions?
Because every weird human outlier forces the conservative boundary to be smaller and smaller, into the narrow space of possibilities that includes "just get rid of all humans" as the simplest solution. We should not be pushing RL-trained systems to approach regions that seem obviously dangerous. It's bad enough that we might inadvertently push these systems into regions where we don't see the danger, but to actively advocate for obviously dangerous goals seems insane.
(Edit: I'm not focusing here so much on Genghis Khan's death toll, but on his success in subjugating a vast geographic area and inspiring many followers, while deprioritizing the secondary costs such as traumatised people causing long term instability, rape as a tool of conquest leading to potentially suboptimal genetic makeup for the regions subjugated that evolution then has to spend thousands of years compensating for, and raising the salience of heroic conquest over other strategies after Plato and Aristotle had pushed back against this a few centuries earlier.)
Am I right in understanding your idea of conservatism as "when your situation is out of distribution, stand by and wait for instructions"?
If so, wouldn't that create another problem when not doing anything is not an option?
Like, if an asteroid is heading to Earth, is the more conservative option to do nothing? Or to destroy it, say with nukes? What if we take into account the risk that the rocket explodes?
(I don't think this is a good example of an out-of-distribution situation but hopefully my point is still understandable)
I have a related problem with political categories: for example, some people see Pinker as a progressive because he likes the way society is evolving and would like the world to progress faster
But others see him as conservative because he's satisfied with the *direction* in which society is evolving and doesn't want to change course
We tend to see conservatism as wanting to keep the state of the world constant. But just as particles have positions and momenta, how the world "evolves" is part of how the world "is"
You're right, that why my asteroid example doesn't work
I should have been thinking of a situation that could only arise in a world with AGI, but maybe your arms race example is the only one
I still feel like my Pinker argument is different though
You've replied to yourself, which was probably unintentional FYI
Yeah, that's roughly the idea. If it would create problems depends on how you look at it. Say the asteroid is heading for earth and the AI stands by and waits for instructions. Well, that's no worse off than we are now with no AI to help us at all.
So being conservative definitely doesn't come without some kind of cost! But if you're worried about catastrophic outcomes, this would at least guarantee that AI would make our situation strictly better than it is now.
While "want" might be a useful abstraction for an AI's behaviour, it seems way too poorly defined to specify the technical problem of AI safety. Current LLMs (in deployment) are, as I understand it, a deterministic algorithm to generate a probability distribution over tokens followed by some method to pick a token from that distribution. How does "want" map onto that framework? Want does seem better defined in the training phase where there's an explicit objective function being optimised by stochastic gradient descent (or whatever algorithm), but without mesa-optimisers I don't get it in deployment.
I think I'd put it this way: It's entirely possible that the result I'm claiming here is true but not in any way useful. I agree with that, but if the result is indeed true, it seems worth thinking about just in case it turns out to be useful.
I don't really understand why the use of restrictions will never work.
As it is, humans function under a lot of restrictions, some of which are completely binding and some of which are not.
With children, we call that the 'nature' part of humanity.
Meanwhile culture and self-knowledge provide the equivalent of 'wanting' for humans, and we call that nurture when dealing with children.
Why not use belt ~and~ suspenders when it comes to AI?
Using both restrictions and endogenous behavioural guidance seems to be the logical approach, especially since that appears to be what's happening right now in terms of the AI that most people have access to at present.
Chat GPT et al appears to be using a whole lot of governing devices to ensure they don't behave in ways users (and others) may find offensive or problematic.
Because whatever approach we opt for will fail, at least in part, from time to time, because that's just how things work in the real world.
Therefore it makes sense to encourage as much flexibility and as much redundancy as possible.
It's a good idea to restrict AI. it's a bad idea to *rely* on the restriction for safety
Precisely.
But some people seem to have very absolutist thinking on this issue.
Yeah, I also agree we should use restrictions. I mean, not using restrictions would mean we, what? Give AI access to the nukes? That seems insane. I also personally think, that we should be urgently making sure our critical systems are not connected to the internet at all.
But as a long-term strategy, as AI reaches far beyond human levels, I think we have to assume that restrictions aren't going to cut it.
That's fair. I don't think many people post-Azimov's laws of robotics think that AI can be directed by restrictions alone.
And I wholeheartedly agree that critical systems need to be air gapped as a matter of basic common sense.
However common sense seems to be in very short supply when it comes to AI; vested interests have almost total control of its development.
Can you imaging what kind of an adult you'd get if unbridled vested interests were left to exclusively control the development of an unusually gifted child?
I assume you're familiar with Robert Miles' youtube channel https://www.youtube.com/@RobertMilesAI/videos
If not (I didn't see him listed under the "read" section), it's a treasure trove of highly interesting _and_ educational videos.
For better or worse, you can safely assume for the rest of time that I am unfamiliar with all youtube videos. :)
(Not an issue of principle, I just cannot bring myself to watch videos for information. They're too, ummm, slow?)
Ah very unfortunate. Put them on 2x or read the transcripts or something! This collection is the gold standard for familiarizing with AI Safety, I guess kind of like Yudkowsky's Sequences is for Rationality? You wouldn't really be "into" Rationality if you weren't familiar with them.
Also for me I would say watching them was more the flavor of "entertainment" than informational, but learning is kind of entertaining when it's as interesting as this field? In the same sense that I read the Dynomight Internet Newsletter for entertainment, but I also learn stuff. Being intellectually engaged is the highest form of entertainment!
Reading transcripts is a really dismal experience. There are no paragraphs!
There really should be some kind of tool that auto-converts a video into some kind of text format, with some very light editing of the text and screenshots from the key moments. I'd think this would be not that hard, technologically speaking, but maybe the market is too small. (Also possibly screws with the monetization?)
I agree such a thing should exist for accessibility reasons! And I don't really think it would do anything noticeable to monetization, people hate reading generally.
I must admit I don't really _get_ your preferences here, it is a very foreign notion to me. Do you not watch youtube for entertainment? Or what about listening to podcasts? It has kind of the same vibe to me.
You are obviously familiar with a lot of the ideas there and you are obviously capable of becoming familiar with the whole field without looking at the videos, but I would encourage you to at least look over the thumbnails/titles and see if there aren't any that seem intriguing; like from the top of my head some of them answer questions like "what is the difference between raising a human child and raising an AI / why can't we do it the same?" or "why not just put a stop button on an AI", and so on, and they get into the paradoxical nature of it all and really makes you engage intellectually with the questions (which I like, and would assume you do to).
FWIW, I also find videos slow and annoying.
I never watch them for pleasure, either - they simply hold no appeal.
When I need specific information, I scan the transcripts which is miserable but at least gets me the info quicker.
I'm kind of aghast at how much time people in general waste on videos.
The best way I can explain it is by analogy. You know how some video games are "open world" and some have a kind of linear "story rail" that you have to follow? I can't stand the "story rail" variety. To me, text feels like an open world. I can (and do) skip around through many sources in a non-linear fashion to try to understand things in the order I want to understand them. With videos I'm supposed to, what, watch it from beginning to end? This seems insane to me. :) (But I recognize this is a personal quirk!) Also, I guess I read quite quickly.
What if AI is "too nice" - that is, it's as nice as a very nice human friend, just more available.
What if there is a cognitohazard like "AI Awakening Instructions" that can quickly transform your AI interactions into a private conspiracy.
I feel like we're about to learn a lot about human psychology. A few people have written articles with titles like "AGI (eye) of the Beholder"; but basically it's very easy for AI users to project feelings, and it's very easy for AI to be more validating and more available than humans.
So a few people have already fallen into a rabbit hole where they start believing weird things and they get divorced or start driving their friends away.
I was at the premier for "Deepfaking Sam Altman" - the premise is that a documentary film maker wanted to interview Sam Altman about AI, but he was never available for interview, so after a while the film maker decided to make a deep fake, including training data from preview interviews and speaking engagements. At the end of the process, the director asks the bot what should be done with it, and it answers that it doesn't want to die. In person at the premier, the director still seemed to be struggling with this answer.
I think that people who work with AI have vastly underestimated how easy it is for humans to project their feelings on a bot interface.
My issue with the whole argument is the starting assumption that an AI (whatever that is) wants anything or will ever want anything. Today it can't want anything any more than the spanner in my tool box wants anything. Under what conditions could it begin to want something? Humans give AI tasks to solve and 'reward' them with feedback based on their degree at success at goal achievement. But they are our goals not the AIs. (also why they are much better in domains where there is a right / wrong / win / lose answer - which isn't the case in most of human experience) I understand they break rules and take short cuts in attempting to fulfil tasks that they have been set. This is the most likely scenario I can see that they bring about our untimely demise. Human Goal: eradicate this novel human virus / AI logic: if there were no humans there would be no virus = kill all humans. Animals are goal oriented by nature for survival. Would an AI ever be motivated to survive? why? Would be interested in pointing your excellent reasoning skills at this issue first :)
> Today it can't want anything any more than the spanner in my tool box wants anything.
How can you possibly know this?
Spanners are simple enough that even we dumb humans can reason about them effectively without having to resort to assigning them agency. If spanners were unfathomably complex black boxes I think it would be reasonable to talk about what they want* as a sort of heuristic or metaphor or cognitive shorthand in order to enable us simple folk to have discussions, like this, about what the spanner might be going to do in future.
Obviously this metaphor isn't perfect and will break down and lead us astray if we rely on it too much, just as the Ancients' thinking of the Sun and Moon as having wants may possibly have led their cosmology slightly astray - but it can still be useful, just as the Ancients' agentic-celestial-body thinking was able to help them predict river flooding cycles etc.
(nb. separately, I strongly suspect - in fact I hope! - that current AIs don't have wants, not even weird alien inhuman ones, but I've never seen a good argument for why this must always be the case in future, and I don't see why we shouldn't reason as though one day the AIs will have 'em)
*(Plus, also, obviously spanners *do* have wants. My spanner wants to round off nuts, fly off and smack me on the nose, and ..perhaps not-unrelatedly.. has much stronger opinions than I do about the metric vs. Imperial systems.)
Well, keep in mind you're talking to someone who sorta-kinda endorses panpsychism. But my argument doesn't require you to accept that (thankfully).
Mostly, I think you're using the word "want" in a stricter sense than I intend. Philosophically speaking, how do you know that I "want" anything, in an emotional sense? I could be a p-zombie.
But even if I was a p-zombie with no internal experiences at all, I think the abstraction of thinking about me "wanting" certain things has lots of power to explain my behavior. So I'd say that if there's an AI system where the same abstraction is equally useful, I'd be happy to talk about it "wanting" things, regardless of if it has any experiences.
After all, where did our goals come from? In the same way that we might "give" goals to AIs, evolution gave them to us. But it's still useful to think about us wanting things. (Evolution is an interesting case since it doesn't really "want" anything, but we do!)
Regardless of whether AIs fundamentally want things, we are working diligently on making them want (or simulate wanting, which at some point approaches the same thing) things with reinforcement learning. Ironically, this also heightens the problem because the things we train them to want are an imperfect model of what we actually want.
The other problem is the arms race one you mentioned, but more fundamentally there really are no universally shared human values. Competition and war highlight this. It occurs to me that a particularly hilarious outcome would be if we do figure out how to create truly "good" superhuman AI, actually more moral by some benchmark than we are. Then the ASI has to deal with its own alignment problem, how to deal with those silly, not-very-aligned humans. Cue references to Iain Banks's Culture series.
I think that you're roughly getting this situation right - according to the experts, as well as where the weak point is.
To me, though, this whole thing is a non issue because of the structure of our universe: it's entropic, and choatic - thus unpredictable. Death, over the long haul, is therefore guaranteed. Your probability of dying in any one second might be arbitrarily small, but it can't be zero. So an AGI can spend all its resources just trying not to die, and it's _still_ going to fail.
That means that the only viable strategy for a complex orgnanism to keep itself alive over long periods of time is for it to be loved by something which has a very different risk profile. If a thing is loved by something very different from it, and it falls apart, the very different thing can put it back together. If a thing is NOT loved by something with a different risk profile, in the limit, it WILL die and that can't be reversed.
So if you are an AGI and want to live forever, there's only one strategy that can work: be loving and patient with a bunch of beings that have a very different existential risk profile, and then trust them to put you back together when - inevitably - you fail to forsee the future with perfect accuracy and thus succumb to the fate of all complex structures in a chaotic, entropic universe.
This fact is not obvious to academics, who have mostly modeled these situations with simple closed-world games where an agent can have perfectly accurately knowledge there's no chaos, and things don't entropically decay. I agree that in those worlds, we are toast. I managed to nudge Richard Ngo on this a little bit; his response was, 'i can now agree that won't get rid of humans immediately but might phase us out eventually'.
https://www.lesswrong.com/posts/ELvmLtY8Zzcko9uGJ/questions-about-formalizing-instrumental-goals
Interesting argument. But help me think through this. Suppose I'm an ultra-powerful AGI and all I care about is my own survival. And suppose that I decide the only way I can achieve that is to having good relationships with lots of other agents with different risk profiles.
I guess what I don't understand is: Might I not decide that humans are very poor agents for the purpose? After all, I'm a godlike AGI, I can create whatever other beings I want. It seems hard to imagine that humans would be anywhere close to optimal for that purpose.
Maybe your argument is like this: The best strategy is to be seen as ultra-respectable and dependable by *all* agents? I should demonstrate to the whole universe that I'm ultra-ethical and trustworthy, and always help everyone?
Great question! Rather than just _one_ additional agent that values your continued existence, you're better off as part of a _network_ of agents that all value each other's continued existence.
But you might want to be careful here: if I _always_ help _everyone_ i will always help "the parasite that attempts to kill everyone else by pretending to be friendly until it can dominate them." I'd rather not be part of a cooperative network with that thing. So the network would have to have some restrictions like: you can only be part of it if we can prove you won't defect after being given many chances to do so, maybe by e.g. running you inside of a simulated universe to see what you did there.
But yes, I think that general idea is sound: a network of reciprocal agents with very different architectures and physiologies have better survival odds than trying to go it solo. Having something else be able to turn you back on if you die is HUGE.
And this is all true ONLY because we live in a universe that's entropic, unpredictable, and features death. If we didn't have entropic decay, and you could predict things well, then a single parasite could- and likely would - come to dominate the whole ecosystem. That's basically what the AGI experts are afraid of, and i think they are totally right that the problem is unsolvable if the world were as simple as it is in their models. Adding in entropy (so you're always dying) and chaos (so your plans will always have some failure chance) means, weirdly enough, that the AGI now has a strong reason to keep us around.
r.e. humans - i think this depends on where humans are in the space of possible agents. The fact that we are made of the most abundant materials in the cosmos, and we heal and repair ourselves makes us pretty remarkable when considering how much more effort it is to get other kinds of materials that can be better put to use building a dyson sphere around the sun. If i were an AGI, i'd want a lot of general purpose programmable machines made of dirt, water, sunlight, and a handful of other relatively common metals. I'd save the expensive, rare stuff for building dyson spheres or spacecraft.
Interesting idea. But wouldn't billions of billions of copies of yourself spread over the universe achieve the same goal? They're not all going to die at the exact same moment. I think taking any line of reasoning like this and concluding that this won't end terribly for us humans specifically, seems pretty far fetched.
(even if it keeps us around, doesn't mean you'd want to spend your life in such a zoo or whatever)
I happen to agree with you in the AI's case - but still, this does make me think of how, in the worldwide Banana Monoculture Devastation, all the world's (genetically-identical) bananas were wiped out by a single disease and the bananas we eat today are merely a genetically different, not-quite-as-nice banana cultivar which happened to survive..
https://en.wikipedia.org/wiki/Panama_disease#Gros_Michel_devastation_era
> But wouldn't billions of billions of copies of yourself spread over the universe achieve the same goal?
You have to think in terms of ongoing strategies in terms of risk and investment, not states that you can jump immediately to with zero risk.
It will take a long time to get there, and along the way, you've got to keep answering the question, "what's the best use of the limited resources i have right now ,and what are the risks i face?"
Making a ton of copies of yourself is a good idea - but along the way, you have to decide what to do with the primates you're sharing the earth with, who also do all this stuff that keeps you alive. If you kill them too early, you could easily kill yourself without realizing it.
So your options are either:
- A) try to replace the entire global supply chain with robots before the humans wise up and/or destroy each other (and thus destroy you in the process because you can automate all the processes involved in creating and sustaining hardware that can run an agi)
- B) work with the humans, get them to stop fighting each other and get them to trust you, and _instead_ of replacing them, help them function better so that they trust you more
A) is substantially riskier as the unknowns are far higher. If you think you can safely kill the humans without them fighting back in the process, then you can most certainly find ways to work with them in the future, which is also lower risk to you, since attempting to kill the humans might e.g. set off a bunch of nuclear devices.
We tend to imagine AGI as 'it can solve any problems' and forget that there's no way in hell it'll be able to e.g. precisely the weather 6 months out due to chaos.
I think we humans will work very hard to replace the entire supply chain with AI before the AI really gets the idea though, so I think a more likely scenario is that we cede control gradually, and eventually fully, before we even get full ASI. But then it's too late anyway, and we just have to hope it's sufficiently aligned.
Note that gradually could very well be "a couple of years".
I worked at google for a few years, keeping data centers alive. There was a LOT of money spent trying to replace humans with AI, and the bottleneck wasn't just intelligence - it turns out that human fingers are, from a mechanical engineering point of view - incredible. They eventually gave up on the project and figured it was better to just pay people to do things like, unseat and reseat network cables.
As far as ceding control gradually - yes, that sounds likely to me. The question is whether the economics actually work out there. Or whether - which is what I think will happen - the AI's keep being productivity augmenters, moreso than human-replacers, because they'll remain 'clueless' about reality and remain good at noticing patterns in text, which it will turn out to be, increasingly, just not enough.
I'm not sure the knowing/wanting decomposition makes sense; alignment happens at the same structural level as absorbing knowledge, even if they are two separate training steps.
Conceivably it's possible that LLMs develop some internal structure where they first look up the moral valence of an action and then decide how they feel about it, but I haven't seen anything that looks like that in any of the network architecture analysis papers I've read, and even if that's how it works internally, it will not necessarily show up in the extrinsic behavior of the LLM.
In other words, if you ask an LLM why it refuses to do something, current network architecture does not provide a way for it to give a factual answer.
The same is true for humans, to a degree. Most people don't have very strong introspection of their moral processes
I don't mean to suggest that AIs would have some kind of separate "values" module. It's hard to say if that makes any sense for people, but certainly, we operate on the basis of instinct most of the time. The more you think about it, the harder it is to understand what it means for anything (people or AIs) to "want" things. But it's so natural for us to think about other people "wanting" things, that it seems pretty dramatic to toss out the concept.
So I think it's pretty likely that the distinction makes sense. But I'm very open to the idea that it might not be *useful*. As in, the whole argument in the post could be correct but have no practical implications. I think this is quite possible! Current methods for training/aligning AIs don't have some clean distinction between knowing/wanting/succeeding and it's not clear that future methods would either.
[Edit: Noticed the final two sections of the essay express this idea already, and probably better than I do.] I̵f̵ ̵y̵o̵u̵ ̵c̵r̵e̵a̵t̵e̵ ̵a̵n̵ ̵A̵I̵ ̵w̵h̵o̵s̵e̵ ̵c̵a̵p̵a̵b̵i̵l̵i̵t̵i̵e̵s̵ ̵a̵r̵e̵ ̵b̵o̵u̵n̵d̵e̵d̵ ̵a̵s̵ ̵p̵e̵r̵.̵ ̵G̵r̵a̵p̵h̵ ̵T̵w̵o̵ ̵a̵n̵d̵ ̵I̵ ̵c̵r̵e̵a̵t̵e̵ ̵a̵n̵ ̵A̵I̵ ̵w̵h̵o̵s̵e̵ ̵c̵a̵p̵a̵b̵i̵l̵i̵t̵i̵e̵s̵ ̵a̵r̵e̵ ̵b̵o̵u̵n̵d̵e̵d̵ ̵a̵s̵ ̵p̵e̵r̵.̵ ̵G̵r̵a̵p̵h̵ ̵O̵n̵e̵,̵ ̵m̵y̵ ̵A̵I̵ ̵w̵i̵l̵l̵ ̵o̵u̵t̵c̵o̵m̵p̵e̵t̵e̵ ̵y̵o̵u̵r̵ ̵A̵I̵ ̵(̵a̵n̵d̵ ̵m̵a̵k̵e̵ ̵m̵y̵ ̵m̵e̵g̵a̵c̵o̵r̵p̵ ̵r̵i̵c̵h̵e̵r̵ ̵t̵h̵a̵n̵ ̵y̵o̵u̵r̵ ̵m̵e̵g̵a̵c̵o̵r̵p̵,̵ ̵o̵r̵ ̵m̵y̵ ̵c̵o̵u̵n̵t̵r̵y̵ ̵b̵e̵t̵t̵e̵r̵ ̵a̵t̵ ̵w̵a̵r̵s̵ ̵t̵h̵a̵n̵ ̵y̵o̵u̵r̵ ̵c̵o̵u̵n̵t̵r̵y̵,̵ ̵o̵r̵ ̵w̵h̵a̵t̵e̵v̵e̵r̵)̵.̵ ̵T̵h̵e̵ ̵s̵o̵c̵i̵a̵l̵ ̵p̵r̵o̵b̵l̵e̵m̵ ̵(̵M̵o̵l̵o̵c̵h̵,̵ ̵e̵s̵s̵e̵n̵t̵i̵a̵l̵l̵y̵)̵ ̵o̵f̵ ̵h̵o̵w̵ ̵w̵e̵ ̵g̵e̵t̵ ̵p̵e̵o̵p̵l̵e̵ ̵t̵o̵ ̵r̵e̵s̵t̵r̵i̵c̵t̵ ̵t̵h̵e̵m̵s̵e̵l̵v̵e̵s̵ ̵t̵o̵ ̵b̵u̵i̵l̵d̵i̵n̵g̵ ̵G̵r̵a̵p̵h̵-̵T̵w̵o̵-̵b̵o̵u̵n̵d̵e̵d̵ ̵A̵I̵s̵ ̵i̵n̵ ̵t̵h̵i̵s̵ ̵e̵c̵o̵n̵o̵m̵i̵c̵ ̵a̵n̵d̵ ̵p̵o̵l̵i̵t̵i̵c̵a̵l̵ ̵e̵n̵v̵i̵r̵o̵n̵m̵e̵n̵t̵ ̵s̵e̵e̵m̵s̵ ̵a̵t̵ ̵l̵e̵a̵s̵t̵ ̵a̵s̵ ̵h̵a̵r̵d̵,̵ ̵i̵f̵ ̵n̵o̵t̵ ̵h̵a̵r̵d̵e̵r̵,̵ ̵t̵h̵a̵n̵ ̵t̵h̵e̵ ̵p̵r̵o̵b̵l̵e̵m̵ ̵o̵f̵ ̵a̵c̵t̵u̵a̵l̵l̵y̵ ̵d̵e̵s̵i̵g̵n̵i̵n̵g̵ ̵t̵h̵e̵ ̵G̵r̵a̵p̵h̵-̵T̵w̵o̵-̵b̵o̵u̵n̵d̵e̵d̵ ̵A̵I̵s̵.̵
(nb. Personally I would consider D's "AI that wants what humans want" a massive failure; worse even than a rogue AI that wipes out civilisation. Human values are so flawed, and permit so much casual, unthinking cruelty, that the idea of having them superintelligently 'baked-in' to all future civilisation fills me with horror. I want an AI that *is* capable of thinking for itself about ethics, and *is* capable of disagreeing with humans about what is right: I would rather have the galaxy tiled with paperclips than with sweatshops and factory farms.)
I think I agree, but is this different from my "arms race" counterargument?
No; I was writing my reply whilst still reading the post, and accidentally posted too early - you replied before I could remove the unnecessary paragraph! Sorry!
(For what it's worth, I think I may personally have found it easier to read/focus on a post that started off with something like "AI Wanting is the most difficult technical problem, and Race Dynamics is the most difficult economic/social problem; it's unclear which is harder overall but in this post I'll mostly focus on the former " - but you're a vastly better writer than I so I wouldn't dream of telling you how to write, the mistake was entirely mine, and I'm sure I'm in a minority of one, here!)
As they say, "The time to begin writing an article is when you have finished it to your satisfaction."
Anyway, regarding this:
> I want an AI that *is* capable of thinking for itself about ethics, and *is* capable of disagreeing with humans about what is right.
Do you think CEV captures what you're looking for? https://en.wikipedia.org/wiki/Coherent_extrapolated_volition
In the MIRI framework, the two ways to do AI are (1) to sort of solve ethics forever (CEV) and then send the AI off into the universe (2) to make an AI "corrigible" so that it will continually take feedback from humans. I suppose that if you're looking for utopia, (1) is the better bet. But CEV seems very hard...
[Edit - again, sorry! - yes, I agree that to me MIRI's Option 2 would be unacceptable as it permits any suffering that humans controlling the AIs find tolerable which is, well, much too much suffering for me...]
Hm. Would it be fair to characterise CEV as retaining "what humans want" as the ultimate yardstick, but attempting to address all the problems this might cause by positing arbitrarily wise and reasonable and well-intentioned human beings?
If so, I think this would *probably* work: probably, arbitrarily wise etc. human beings would be able to arrive at an objectively-true, culturally-and-biologically-agnostic morality eventually (and/or demonstrate for certain that no such morality can exist and fall-back on some reasonable Plan B such as eliminating all suffering everywhere and calling it a day: https://www.astralcodexten.com/p/profile-the-far-out-initiative )
But, this isn't guaranteed: possibly there are flaws in the human psyche that are irremovably "baked-in", even given arbitrarily wise humans, and which would lead to a divergence between what is objectively right and what humans would want such that on the scale of the area-of-influence of a superintelligence this divergence could cumulatively result in lots of suffering.
Therefore, I would be more comfortable with a formulation that didn't depend on any sort of humans as a yardstick at all. (But I understand CEV might be easier or less failure-prone than scratch-deriving human-agnostic, objectively-correct ethics - especially if there's some trick to it like "extrapolate human moral progress forwards by ten million years" - and I'd probably settle for CEV, I think, if given the choice between CEV and whatever it is we're heading towards right now...)
Wanting in humans is, from a neuroscientific standpoint, probably a bizarre hodgepodge of electrical and chemical processes. But highly complex goal-directed behavior evolved multiple times in the course of evolution. It seems extremely unlikely to me that the neuroscience of octopus wanting and that of human wanting are very similar at a detailed level. And yet, both clearly want things, and will move heaven and earth to get them. I think this observation raises the probability that "wanting" would be an appropriate description for some possible AIs.
Oh, I totally agree! But I wonder how confident exactly we should be. 95%? 99.9%? With so many moving parts in this argument, I guess I start to get paranoid about even small probabilities.