Posts

Sorted by New

Wiki Contributions

Comments

I dislike the framing of this post. Reading this post made the impression that

  • You wrote a post with a big prediction ("AI will know about safety plans posted on the internet")
  • Your post was controversial and did not receive a lot of net-upvotes
  • Comments that disagree with you receive a lot of upvotes. Here you make me think that these upvoted comments disagree with the above prediction.

But actually reading the original post and the comments reveals a different picture:

  • The "prediction" was not a prominent part of your post.
  • The comments such as this imo excellent comment did not disagree with the "prediction", but other aspects of your post.

Overall, I think its highly likely that the downvotes where not because people did not believe that future AI systems will know about safety plans posted on LW/EAF, but because of other reasons. I think people were well aware that AI systems will get to know about plans for AI safety, just as I think that it is very likely that this comment itself will be found in the training data of future AI systems.

Yes, good catch. I edited. I made two mistakes in the above:

  • confused personal money with "altruistic money": In the beginning of the comment I assumed that all money would be donated, and none kept. By the end of my comment, my mental model has apparently shifted to also include personal money/"selfish money" (which would be justified for people to keep).

  • I included a range of numbers for the possible bet size, and thought that lower bet amounts would be justified due to diminishing returns. Checking the numbers again, the diminishing returns are not that significant (at the scale of $1B likely far below 10x), and my opinion is now that you should bet everything.

An assumption that seems to be present in the betting framework here is that you frequently encounter bets which have positive EV.

I think in real life, that assumption is not particularly realistic. Most people do not encounter a lot of opportunities whose EV (in money) is significantly above ordinary things such as investing in the stock market.

Suppose you have $100k and are in the situation where you only win 10% of the time, but if you do you get paid out 10,000x your bet size. But after the bet you do not expect to find similarly opportunities again and you also plan to donate everything to GiveDirectly. If you were to rank optimize, which, iiuc, mean maximizing the probability of being "the richest person in the room", then you should bet nothing, because then you have a 90% probability of being richer than the counterfactual-you who bets a fraction of the wealth. But if you care a lot about the value your donations provide to the world, then you should probably bet $40k-$100k (depending on the diminishing returns of money to GiveDirectly, or maybe valuing having a bit of money for selfish reasons). edit: But if you care a lot about the value your donations provide to the world, then you should probably bet all $100k (there are likely diminishing returns of money given to GiveDirectly, but I think the high upside of the bet outweighs the diminishment by a big margin. Also, by assumption of this thought experiment, you were not planning to keep any money for selfish purposes.).

This just feels like pretend, made-up research that they put math equations in to seem like it's formal and rigorous.

Can you elaborate which parts feel made-up to you? E.g.:

  • modelling a superintelligent agent as a utility maximizer
  • considering a 3-step toy model with , ,
  • assuming that a specification of exists

At the end of all those questions, I feel no closer to knowing if a machine would stop you from pressing a button to shut it off.

The authors do not claim to have solved the problem and instead state that this is an open problem. So this is not surprising that there is not a satisfying answer.

I would also like to note, that the paper has many more caveats.

Do you think it would still feel fake to you if the paper had a more positive answer to the problem described (eg a description how to modify a utility function of an agent in a toy model such that it does not incentivize the agent to prevent/cause the pressing of the shutdown button)?

I fail to see how this changes the answer to the St Petersburg paradox. We have the option of 2 utility with 51% probability and 0 utility with 49% probability, and a second option of utility 1 with 100%. Removing the worst 0.5% of the probability distribution gives us a probability of 48.5% for utility 0, and removing the best 0.5% of the probability distribution gives us a probability of 50.5% for utility 2. Renormalizing so that the probabilities sum to gives us probabilities for utility , and for utility . The expected value is then still greater than . Thus we should choose the option where we have a chance at doubling utility.

P(misalignment x-risk | AGI that understands democratic law) < P(misalignment x-risk | AGI)

I don't think this is particularly compelling. While technically true, the difference between those probabilities is tiny. Any AGI is highly likely to understand democratic laws.

Summary of your argument: The training data can contain outputs of processes that have superhuman abilities (eg chess engines), therefore LLMs can exceed human performance.

More speculatively, there might be another source of (slight?) superhuman abilities: GPT-N could generalize/extrapolate from human abilities to superhuman abilities, if it was plausible that at some point in the future these superhuman abilities would be shown on the internet. For example, it is conceivable that GPT-N prompted with "Here is a proof of the Riemann hypothesis that has been verified extensively:" would actually a valid proof, even if a proof of the Riemann hypothesis was beyond the ability of humans in the training data.

But perhaps this is an assumption people often make about LLMs.

I think people often claim something along the lines of "GPT-8 cannot exceed human capacity" (which is technically false) to argue that a (naively) upscaled version of GPT-3 cannot reach AGI. I think we should expect that there are at least some limits to the intelligence we can obtain from GPT-8, if they just train it to predict text (and not do any amplification steps, or RL).

Because it was not trained using reinforcement learning and doesn't have a utility function, which means that it won't face problems like mesa-optimisation

I think this is at least a non-obvious claim. In principle, it is conceivable that mesa-optimisation can occur outside of RL. There could be an agent/optimizer in (highly advanced, future) predictive models, even if the system does not really have a base objective. In this case, it might be better to think in terms of training stories rather than inner+outer alignment. Furthermore, there could still be issues with gradient hacking.

Great post! I agree that academia is a resource that could be very useful for AI safety.

There are a lot of misunderstandings around AI safety and I think the AIS community has failed to properly explain the core ideas to academics until fairly recently. Therefore, I often encountered confusions like that AI safety is about fairness, self-driving cars and medical ML.

I think these misunderstandings are understandable based on the term "AI safety". Maybe it would be better to call the field AGI safety or AGI alignment? This seems to me like a more honest description of the field.

You also write that you find it easier to not talk about xrisk. If we avoid talking about xrisk while presenting AI safety, then some misunderstandings about AI safety will likely persist in the future.

(Copied partially from here)

My intuition is that preDCa falls short on the "extrapolated" part in "Coherent extrapolated volition". PreDCA would extract a utility function from the flawed algorithm implemented by a human brain. This utility function would be coherent, but might not be extrapolated: The extrapolated utility function (ie what humans would value if they would be much smarter) is probably more complicated to formulate than the un-extrapolated utility function.

For example, the policy implemented by an average human brain probably contributes more to total human happiness than most other policies. Lets say is an utility function that values human happiness as measured by certain chemical states in the brain, and is "extrapolated happiness" (where "putting all humans brains in vat to make it feel happy" would not be good for ). Then it is plausible that . But the policy implemented by an average human brain would do approximately equally well on both utility functions. Thus, .

Load More