I'm sorry but I don't get the explanation regarding the coinrun. I claim that the "reward as incentivization" framing still "explains" the behaviour in this case. As an analogy, we can go back to training a dog and rewarding it with biscuits: let's say you write numbers on the floor from 1 to 10. You ask the dog a simple calculus question (whose answer is between 1 to 10), and each time he puts its paw on the right number he gets a biscuit. Let's just say that during the training it so happens that the answer to all the calculus questions is always 6. Would you claim that you taught the dog to answer simple calculus questions, or rather that you taught it to put his paw on 6 when you ask him a calculus question? If the answer is the latter then I don't get why the interpretation through the "reward as incentivization" framing in the CoinRun setting is that the model "wants to get the coin" in the CoinRun.
The generalized version of this lesson "that cooperation/collusion favors the good guys - ie those aligned towards humanity" actually plays out in history. In WW2 the democratic powers - those with interconnected economies and governments more aligned to their people - formed the stronger allied coalition. The remaining autocratic powers - all less aligned to their people and also each other - formed a coalition of necessity. Today history simply repeats itself with the democratic world aligned against the main autocratic powers (russia, china, north korea, iran).
I don't want to enter a history debate, but I'm not at all sold on that view, which seems to rewrite history. The european part of WW2 was mainly won because of the USSR, not really a "democratic power" (you could argue that USSR would never have had the means to do that without the financial help of the US, or that without England holding up, Germany would have won on the eastern front, both of which are probably true, but the point still stands that it's not as simple as "democratic vs autocratic").
Regarding the present, I'm not sold at all on the "democratic world aligned against the main autocratic powers". Actually, I'd even make the case that democratic powers actively cooperate with autocratic ones as long as they have something to gain, despite it being contrary to the values they advocate for: child labor in asian coutries, women's rights in Emirates, Qatar, Saudi Arabia, and so on. So I believe that once we look at a more detailed picture than the one you're depicting it's actually a counterargument to your take.
I don't actually think we're bottlenecked by data. Chinchilla represents a change in focus (for current architectures), but I think it's useful to remember what that paper actually told the rest of the field: "hey you can get way better results for way less compute if you do it this way."I feel like characterizing Chinchilla most directly as a bottleneck would be missing its point. It was a major capability gain, and it tells everyone else how to get even more capability gain. There are some data-related challenges far enough down the implied path, but we have no reason to believe that they are insurmountable. In fact, it looks an awful lot like it won't even be very difficult!
I don't actually think we're bottlenecked by data. Chinchilla represents a change in focus (for current architectures), but I think it's useful to remember what that paper actually told the rest of the field: "hey you can get way better results for way less compute if you do it this way."
I feel like characterizing Chinchilla most directly as a bottleneck would be missing its point. It was a major capability gain, and it tells everyone else how to get even more capability gain. There are some data-related challenges far enough down the implied path, but we have no reason to believe that they are insurmountable. In fact, it looks an awful lot like it won't even be very difficult!
Could you explain why you feel that way about Chinchilla? Because I found that post: https://www.lesswrong.com/posts/6Fpvch8RR29qLEWNH/chinchilla-s-wild-implications to give very compelling reasons for why data should be considered a bottleneck and I'm curious what makes you say that it shouldn't be a problem at all.
I'd very much like to understand how your credences can be so high with nothing else to back them up than "it's possible and we lack some data". Like, sure, but to have credences so high you need to have at least some data or reason to back that up.
Humans have not evolved to do math or physics, but we did evolve to resist manipulation and deception, these were commonplace in the ancestral environment.
This seems pretty counterintuitive to me, seeing how easily many humans fall for not-so-subtle deception and manipulation everyday.
I really don't understand the AGI in a box part of your arguments: as long as you want your AGI to actually do something (it can be anything, be it you asked for a proof of a mathematical problem or whatever else), its output will have to go through a human anyway, which is basically the moment when your AGI escapes. It does not matter what kind of box you put around your AGI because you always have to open it for the AGI to do what you want it to do.
The second case might not really make sense, because deception is a convergent instrumental goal especially if the AI is trying to cause X and you're trying to cause not X, and generally because an AI that smart probably has inner optimizers that don't care about this "make a plan, don't execute plans" thing you thought you'd set up.
I believe the second case is a subcase of the problem of ELK. Maybe the AI isn't trying to deceive you, and actually do what you asked it to do (e.g., I want to see "the diamond" on the main detector), yet the plans it produces has consequence X that you don't want (in the ELK example, the diamond is stolen but you see something that looks like the diamond on the main detector). The problem is: how can you be sure the plans proposed have consequence X? Especially if you don't even know X is a possible consequence of the plans?
Why would I press the dislike button when I get the possibility to signal virtue by showing people I condemn what "X" says about "Y"?
Talking about the fact that each consciouness will only see a classical state doesn't make sense, because they are in a quantum superposition state. Just like it does not make sense to say that the photon went either right or left in the double slit experiment.
You created a superposition of a million consciousnesses and then outputted an aggregate value about all those consciousnesses.
This I agree with.
Either a million entities experienced a conscious experience, or you can find out the output of a conscious being with ever actually creating a conscious being - i.e. p-zombies exist (or at least aggregated p-zombies).
This I do not. You do not get access to a million entities by the argument I laid out previously. You did not simulate all of them. And you did not create something that behaves like a million entities aggregated either, just like you cannot store 2^n classical bits on a quantum computer consisting of n qbits. You get a function which outputs an aggregated value of your superposition, but you can't recover each consciousness you pretend to have been simulated from it. Therefore this is what I believe is flawed in your position:
it seems reasonable to extend that to something which act exactly the same as an aggregate of conscious beings - it must in fact be an aggregate of conscious beings
If I understand your arguments correctly (which I may not, in which case I'll be happy to stand corrected), this sentence should mean to you that for something to act the same as an aggregate of n conscious beings, it must be an aggregate of at least n conscious beings? But then doesn't this view mean that a function of d variables can never be reduced to a function of k variables, k < d?