# All of Leo P.'s Comments + Replies

I'm sorry but I don't get the explanation regarding the coinrun. I claim that the "reward as incentivization" framing still "explains" the behaviour in this case. As an analogy, we can go back to training a dog and rewarding it with biscuits: let's say you write numbers on the floor from 1 to 10. You ask the dog a simple calculus question (whose answer is between 1 to 10), and each time he puts its paw on the right number he gets a biscuit. Let's just say that during the training it so happens that the answer to all the calculus questions is always 6. Woul...

2DragonGod3mo
Strong agree and up vote. The issue is simply that the training did not uniquely constrain the designer's intended objectives, and that's independent of whether the training was incentivisation or selection.

The generalized version of this lesson "that cooperation/collusion favors the good guys - ie those aligned towards humanity" actually plays out in history. In WW2 the democratic powers - those with interconnected economies and governments more aligned to their people - formed the stronger allied coalition. The remaining autocratic powers - all less aligned to their people and also each other - formed a coalition of necessity. Today history simply repeats itself with the democratic world aligned against the main autocratic powers (russia, china, north korea

...
2jacob_cannell5mo
USSR originally sided with axis, then switched to allies, but yes obviously the coalitions are not pure. "democratic world aligned against the main autocratic powers" obviously doesn't mean that democratic powers don't also cooperate or trade with autocratic powers - obviously we still have extensive trade with china for example. I meant "aligned against" in a larger strategic sense. Clearly we are at near proxy war with Russia already, and we have recently taken steps to try and cripple china's long term strategic - and especially - military power with the recent foundry-targeting embargoes.

I don't actually think we're bottlenecked by data. Chinchilla represents a change in focus (for current architectures), but I think it's useful to remember what that paper actually told the rest of the field: "hey you can get way better results for way less compute if you do it this way."

I feel like characterizing Chinchilla most directly as a bottleneck would be missing its point. It was a major capability gain, and it tells everyone else how to get even more capability gain. There are some data-related challenges far enough down the implied path, but we

...
6porby6mo
Some of my confidence here arises from things that I don't think would be wise to blab about in public, so my arguments might not be quite as convincing sounding as I'd like, but I'll give a try. I wouldn't quite say it's not a problem at all, but rather it's the type of problem that the field is really good at solving. They don't have to solve ethics or something. They just need to do some clever engineering with the backing of infinite money. I'd put it at a similar tier of difficulty as scaling up transformers to begin with. That wasn't nothing! And the industry blew straight through it. To give some examples that I'm comfortable having in public: 1. Suppose you stick to text-only training. Could you expand your training sets automatically? Maybe create a higher quality transcription AI [https://openai.com/blog/whisper/] and use it to pad your training set using the entirety of youtube [https://twitter.com/ethanCaballero/status/1572692314400628739]? 2. Maybe you figure out a relatively simple way to extract more juice from a smaller dataset that doesn't collapse into pathological overfitting. 3. Maybe you make existing datasets more informative by filtering out sequences that seem to interfere with training. 4. Maybe you embrace multimodal training where text-only bottlenecks are irrelevant. 5. Maybe you do it the hard way. What's a few billion dollars?

I'd very much like to understand how your credences can be so high with nothing else to back them up than "it's possible and we lack some data". Like, sure, but to have credences so high you need to have at least some data or reason to back that up.

4DirectedEvolution9mo
When I wrote the comment, it was mainly because of the prevalence of subacute hypothyroidism in the general population. However, one of Natalia’s studies that I hadn’t been focusing on persuaded me that variations in dietary lithium intake is unlikely to be responsible for so much weight gain. Serum lithium isn’t associated with BMI.

Humans have not evolved to do math or physics, but we did evolve to resist manipulation and deception, these were commonplace in the ancestral environment.

This seems pretty counterintuitive to me, seeing how easily many humans fall for not-so-subtle deception and manipulation everyday.

1jsnider31mo
Yes, the average human is dangerously easy to manipulate, but imagine how bad the situation would be if they didn't spend a hundred thousand years evolving to not be easily manipulated.

I really don't understand the AGI in a box part of your arguments: as long as you want your AGI to actually do something (it can be anything, be it you asked for a proof of a mathematical problem or whatever else),  its output will have to go through a human anyway, which is basically the moment when your AGI escapes. It does not matter what kind of box you put around your AGI because you always have to open it for the AGI to do what you want it to do.

4Rafael Harth10mo
I've found it helpful to think in terms of output channels rather than boxing. When you design an AI, you can choose freely how many output channels you give it; this is a spectrum. More output channels means less security but more capability. A few relevant places on the spectrum: * No output channels at all (if you want to use the AI anyway, you have to do so with interpretability tools, this is Microscope AI [https://www.lesswrong.com/posts/X2i9dQQK3gETCyqh2/chris-olah-s-views-on-agi-safety#Building_microscopes_not_agents]) * A binary output channel. This is an Oracle AI. * A text box. GPT-3 has that. * A robot body While I don't have sources right now, afaik this problem has been analyzed at a decent level of depth, and no-one has found a point on the spectrum that really impresses in terms of capability and safety. A binary output channel is arguably already unsafe, and a text box definitely is. Microscope AI may be safe (though you could certainly debate that) but we are about ∞ far away from having interpretability tools that make it competitive.

The second case might not really make sense, because deception is a convergent instrumental goal especially if the AI is trying to cause X and you're trying to cause not X, and generally because an AI that smart probably has inner optimizers that don't care about this "make a plan, don't execute plans" thing you thought you'd set up.

I believe the second case is a subcase of the problem of ELK. Maybe the AI isn't trying to deceive you, and actually do what you asked it to do (e.g., I want to see "the diamond" on the main detector), yet the plans it produces...

Why would I press the dislike button when I get the possibility to signal virtue by showing people I condemn what "X" says about "Y"?

Talking about the fact that each consciouness will only see a classical state doesn't make sense, because they are in a quantum superposition state. Just like it does not make sense to say that the photon went either right or left in the double slit experiment.

You created a superposition of a million consciousnesses and then outputted an aggregate value about all those consciousnesses.

This I agree with.

Either a million entities experienced a conscious experience, or you can find out the output of a conscious being with ever actually creating a conscious being - i.e. p-zombies exist (or at least aggregated p-zombies).

This I do not. You do not get access to a million entities by the argument I laid out previously. You did not simulate all of them. And you did not create something that behaves like a million entiti...

I'm not sure I understand your answer. I'm saying that you did not simulate at least million consciousnesses, just as in Shor's algorithm you do not try all the divisors.

You created a superposition of a million consciousnesses and then outputted an aggregate value about all those consciousnesses. Either a million entities experienced a conscious experience, or you can find out the output of a conscious being with ever actually creating a conscious being - i.e. p-zombies exist (or at least aggregated p-zombies).

So even though we've run the simulation function only a thousand times, we must have simulated at least million consciousnesses, or how else could we know that exactly 254,368 of them e.g. output a message which doesn't contain the letter e?

Isn't that exactly what Scott Aaronson explains a quantum computer/algorithm doesn't do? (See https://www.scottaaronson.com/papers/philos.pdf, page 34 for the full explanation)

Sort of. It kind of is how it works, at least for some algorithms, but getting the output is tricky and requires interference. For our purposes the details don't matter.

For AIs, we're currently interested in the values that arise in a single AIs (specifically, the first AI capable of a hard takeoff), so single humans are the more appropriate reference class.

I'm sorry but I don't understand why looking at single AIs make single humans the more appropriate reference class.

1Quintin Pope1y
I'm drawing an analogy between AI training and human learning. I don't think the process of training an AI via reinforcement learning is as different from human learning as many assume.

That seems like extremely limited, human thinking. If we're assuming a super powerful AGI, capable of wiping out humanity with high likelihood, it is also almost certainly capable of accomplishing its goals despite our theoretical attempts to stop it without needing to kill humans.

If humans are capable of building one AGI, they certainly would be capable to build a second one which could have goals unaligned with the first one.

1Marion Z.1y
I assume that any unrestrained AGI would pretty much immediately exert enough control over the mechanisms through which an AGI might take power (say, the internet, nanotech, whatever else it thinks of) to ensure that no other AI could do so without its permission. I suppose it is plausible that humanity is capable of threatening an AGI through the creation of another, but that seems rather unlikely in practice. First-mover advantage is incalculable to an AGI.

Even admitting that alignement is not possible, it's not clear why humanity and super-AGI goals should be in contrast, and not just different. Even admitting that they are highly likely to be in contrasts, is not clear why strategies to counter this cannot be of effect (e.g. parner up with a "good" super-AGI).

Because unchecked convergent instrumental goals for AGI are already in contrast with humanity goals. As soon as you realize humanity may have reasons to want to shut down/restrain an AGI (through whatever means), this gives ground to the AGI to wipe humanity.

1Marion Z.1y
That seems like extremely limited, human thinking. If we're assuming a super powerful AGI, capable of wiping out humanity with high likelihood, it is also almost certainly capable of accomplishing its goals despite our theoretical attempts to stop it without needing to kill humans. The issue, then, is not fully aligning AGI goals with human goals, but ensuring it has "don't wipe out humanity, don't cause extreme negative impacts to humanity" somewhere in its utility function. Probably doesn't even need to be weighted too strongly, if we're talking about a truly powerful AGI. Chimpanzees presumably don't want humans to rule the world - yet they have made no coherent effort to stop us from doing so, probably haven't even realized we are doing so, and even if they did we could pretty easily ignore it. "If something could get in the way (or even wants to get in the way, whether or not it is capable of trying) I need to wipe it out" is a sad, small mindset and I am entirely unconvinced that a significant portion of hypothetically likely AGIs would think this way. I think AGI will radically change the world, and maybe not for the better, but extinction seems like a hugely unlikely outcome.

An other funny thing you can do with square roots: let's take  and let us look at the power series . This converges for , so that you can specialize in . Now, you can also do that inside , and the series converges to , the'' square root of . But in  this actually converges to

But although bayesianism makes the notion of knowledge less binary, it still relies too much on a binary notion of truth and falsehood. To elaborate, let’s focus on philosophy of science for a bit. Could someone give me a probability estimate that Darwin’s theory of evolution is true?

What do you mean by that question? Because the way I understand it, then the probability is "zero". The probability that, in the vast hypotheses space, Darwin's theory of evolution is the one that's true, and not a slightly modified variant, is completely negligible. My main p...

Regarding the stopping rule issue, it really depends how you decide the stopping. I believe sequential inference lets you do that without any problem but it's not the same as saying that the p-value is within the wanted bounds. But basically all of this derives from working with p-values instead of workable values like log-odds. The other problem of p-values is that it only lets you work with binary hypotheses and makes you believe that writing things like P(H0) actually carry a meaning, when in reality you can't test an hypothesis in a vacuum, you have to...

5Younes Kamel1y
I'm not as versed in mistakes of meta-analysis yet, but I'm working on it ! Once I compile enough meta-analysis misuses I will add them to the post. Here is one that's pretty interesting : https://crystalprisonzone.blogspot.com/2016/07/the-failure-of-fail-safe-n.html [https://crystalprisonzone.blogspot.com/2016/07/the-failure-of-fail-safe-n.html]   Many studies still use fail-safe N to account for publication bias when it has been shown to be invalid. If you see a study that uses it you can act as if they did not account for publication bias at all.
3TLW1y
One surprisingly good sanity check I've found is to do up a quick Monte Carlo sim in e.g. Python. As someone who uses statistics, but who is not a statistician, it's caught an astounding number of subtle issues.