They state that their estimated probability for each event is conditional on all previous events happening.
I think this is an excellent, well-researched contribution and am confused about why it's not being upvoted more (on LW that is; it seems to be doing much better on EAF, interestingly).
I see, that makes sense. I agree that holding all else constant more neurons implies higher intelligence.
Within a particular genus or architecture, more neurons would be higher intelligence.
I'm not sure that's necessarily true? Though there's probably a correlation. See e.g. this post:
[T]he raw number of neurons an organism possesses does not tell the full story about information processing capacity. That’s because the number of computations that can be performed over a given amount of time in a brain also depends upon many other factors, such as (1) the number of connections between neurons, (2) the distance between neurons (with shorter distances allowing faster communication), (3) the conduction velocity of neurons, and (4) the refractory period which indicates how much time must elapse before a given neuron can fire again. In some ways, these additional factors can actually favor smaller brains (Chitka 2009).
Once we start searching over policies that understand the world well enough, we run into a problem: any influence-seeking policies we stumble across would also score well according to our training objective, because performing well on the training objective is a good strategy for obtaining influence.
I'm slightly confused by this. It sounds like "(1) ML systems will do X because X will be rewarded according to the objective, and (2) X will be rewarded according to the objective because being rewarded will accomplish X". But (2) sounds circular -- I see that performing well on the training objective gives influence, but I would've thought only effects (direct and indirect) on the objective are relevant in determining which behaviors ML systems pick up, not effects on obtaining influence.
Maybe that's the intended meaning -- I'm just misreading this passage, but also maybe I'm missing some deeper point here?
Terrific post, by the way, still now four years later.
I vaguely remember OpenAI citing US law as a reason they don't allow Chinese users access, maybe legislation passed as part of the chip ban?
Nah, the export controls don't cover this sort of thing. They just cover chips, devices that contain chips (i.e. GPUs and AI ASICs), and equipment/materials/software/information used to make those. (I don't know the actual reason for OpenAI's not allowing Chinese customers, though.)
If only we could spread the meme of irresponsible Western powers charging head-first into building AGI without thinking through the consequences and how wise the Chinese regulation is in contrast.
That sort of strategy seems like it could easily backfire, where people only pick up the first part of that statement ("irresponsible Western powers charging head-first into building AGI") and think "oh, that means we need to speed up". Or maybe that's what you mean by "if only" -- that it's hard to spread even weakly nuanced messages?
Thanks, this analysis makes a lot of sense to me. Some random thoughts:
I'm really confused by this passage from The Six Mistakes Executives Make in Risk Management (Taleb, Goldstein, Spitznagel):
We asked participants in an experiment: “You are on vacation in a foreign country and are considering flying a local airline to see a special island. Safety statistics show that, on average, there has been one crash every 1,000 years on this airline. It is unlikely you’ll visit this part of the world again. Would you take the flight?” All the respondents said they would.
We then changed the second sentence so it read: “Safety statistics show that, on average, one in 1,000 flights on this airline has crashed.” Only 70% of the sample said they would take the flight. In both cases, the chance of a crash is 1 in 1,000; the latter formulation simply sounds more risky.
One crash every 1,000 years is only the same as one crash in 1,000 flights if there's exactly one flight per year on average. I guess they must have stipulated that in the experiment (of which there's no citation), because otherwise it's perfectly rational to suppose the first option is safer (since generally an airline serves >1 flight per year)?
I’m confused about the parallelization part and what it implies. It says the model was trained on 2K GPUs but GPT-4 was probably trained on 1 OOM more than that right?