silent-observer
silent-observer has not written any posts yet.

silent-observer has not written any posts yet.

I would like to object to the variance explanation: in the Everett interpretation there was not even one collapse since the Big Bang. That means that every single quantum-ly random event from the start of the universe is already accounted in the variance. Over such timescales variance easily covers basically anything allowed by the laws: universes where humans exist, universes where they don't, universes where humans exist but the Earth is shifted 1 meter to the right, universes where the start of Unix timestamp is defined to start in 1960 and not 1970, because some cosmic ray hit the brain of some engineer at exactly the right time, and certainly universes like ours but you pressed the "start training" button 0.153 seconds later. The variance doesn't have to stem from how brains are affected by quanum fluctuations now, it can also stem from how brains are affected by regular macroscopical external stimuli that resulted from quantum fluctuations that happened billions of years ago.
I guess we are converging. I'm just pointing out flaws in this option, but I also can't give a better solution off the top of my head. At least this won't insta-kill us, assuming that real-world humans count as non-copyable agents (how does that generalize again? Are you sure RL agents can just learn our definition of an agent correctly, and that won't include stuff like ants?), and that they can't get excessive virtual resources from our world without our cooperation (in that case a substantial amount of agents goes rogue, and some of them get punished, but some get through). I still think we can do better than this though, somehow.
(Also... (read more)
First of all, I do agree with your premises and the stated values, except for the assumption that "technically superior alien race" would be safe to create. If such an AI would have its own values other than helping humans/other AIs/whatever, then I'm not sure how I feel about it balancing its own internal rewards (like getting an enormous amount of virtual food that in its training environment could save billions of non-copyable agents) against real-world goals (like saving 1 actual human). We certainly want a powerful AI to be our ally rather than trying to contain it against its will, but I don't think we should go with the "autonomous being"... (read 545 more words →)
Ah, true. I just think this wouldn't be enough and that there could be distributional shift if the agents are put into an environment with low cooperation rewards and high resource competition. I'll reply in more detail under your new post, it looks a lot better
So, no "reading" minds, just looking at behaviours? Sorry, I misundertood. Are you suggesting the "look at humans, try to understand what they want and do that" strategy? If so, then how do we make sure that the utility function they learned in training is actually close enough to actual human values? What if the agents learn something on the level "smiling humans = good", which isn't wrong by default, but is wrong if taken to the extreme by a more powerful intelligence in the real world?
Ah, I see. But how would they actually escape the deception arms race? The agents still need some system of detecting cooperation, and if it can be easily abused, it generally will be (Goodhart's Law and all that). I just can't see any other outcome other than agents evolving exceedingly more complicated ways to detect if someone is cooperating or not. This is certainly an interesting thing to simulate, but I'm not sure how that is useful for aligning the agents. Aren't we supposed to make them not even want to deceive others, instead of trying to find a deception strategy and failing? (Also, I think even an average human isn't that well aligned as we want our AIs to be. You wouldn't want to give a random guy from the street nuclear codes, would you?)
it's the agent's job to convince other agents based on its behavior
So agents are rewarded for doing stuff that convinces others that they're a "happy AI", not necessarily actually being a "happy AI"? Doesn't that start an arms race of agents coming up with more and more sophisticated ways to deceive each other?
Like, suppose you start with a population of "happy AIs" that cooperate with each other, then if one of them realizes there's a new way to deceive the others, there's nothing to stop them until other agents adapt to this new kind of deception and learn to detect it? That feels like training an inherently unsafe and deceptive AI that also are extremely suspicious of others, not something "happy" and "friendly"
Except the point of Yudkowsky's "friendly AI" is that they don't have freedom to pick their own goals, they have the goals we set to them, and they are (supposedly) safe in a sense that "wiping out humanity" is not something we want, therefore it's not something an aligned AI would want. We don't replicate evolution with AIs, we replicate careful design and engineering that humans have used for literally everything else. If there is only a handful of powerful AIs with careful restrictions on what their goals can be (something we don't know how to do yet), then your scenario won't happen
Since there are no humans in the training environment, how do you teach that? Or do you put human-substitutes there (or maybe some RLHF-type thing)? Also, how would such AIs will even reason about humans, since they can't read our thoughts? How are they supposed to know if we would like to "vote them out" or not? I do agree though that a swarm of cooperative AIs with different goals could be "safer" (if done right) than a single goal-directed agent.
This setup seems to get more and more complicated though. How are agents supposed to analyze "minds" of each other? I don't think modern neural nets can do that yet. And if... (read more)
I think the application to the Hero With A Thousand Chances is partly incorrect because of a technicality. Consider the following hypothesis: there is a huge number of "parallel worlds" (not Everett branches, just thinking of different planets very far away is enough) each fighting the Dust. Every fight each of those worlds summons a randomly selected hero. Today that hero happened to be you. The world that happened to summon you has survived the encounter with the Dust 1079 times before you. The world next to it has already survived 2181, and the other one was destroyed during the 129th attempt.
This hypothesis explains the observation of the hero pretty well -... (read more)