Is anyone using the book as a funnel to LessWrong? I don't think MIRI are (afaik). The only (again, afaik) event going on in the UK is being joint hosted by Pause AI, Control AI and some other local community members are helping out, which is not going to be a funnel for LW at all. I assume Lighthaven is doing something (haven't checked) but are they going to say "If you like this book you'll love our online forum?"
Moreover, is using LessWrong as the default funnel a good idea in the first place? I'd guess not. I know lots of people (notably Oliver Habryka) don't approve of Pause AI or Control AI, but I assume there must be other directions for suddenly-invigorated normies to be pointed in (though I've not actually looked for them).
That is true. "People are on hunger strikes and the CEOs haven't even commented" is (some) public evidence of "AI CEOs are unempathetic"
I misunderstood your point, I thought you were arguing against painting individuals as evil in general.
I don't really see the problem with painting people as evil in principle, given that some people are evil. You can argue against it in specific cases, but I think the case for AI CEOs being evil is strong enough that it can't be dismissed out of hand.
The case in question is "AI CEOs are optimising for their short-term status/profits, and for believing things about the world which maximise their comfort, rather than doing the due diligence required of someone in their position, which is to seriously check whether their company is building something which kills everyone"
Whether this is a useful frame for one's own thinking---or a good frame to deploy onto the public---I'm not fully sure, but I think it does need addressing. Of course it might also differ between CEOs. I think Demis and Dario are two of the CEOs who it's relatively less likely to apply to, but also I don't think it applies weakly enough for them to be dismissed out of hand even in their cases.
Fair enough. I think these actions are +ev under a coarse grained model where some version of "Attention on AI risk" is the main currency (or a slight refinement to "Not-totally-hostile attention on AI risk"). For a domain like public opinion and comms, I think that deploying a set of simple heuristics like "Am I getting attention?" "Is that attention generally positive?" "Am I lying or doing something illegal?" can be pretty useful.
Michael said on twitter here that he's had conversations with two sympathetic DeepMind employees, plus David Silver, who was also vaguely sympathetic. This itself is more +ev than I expected already, so I'm updating in favour of Michael here.
It's also occurred to me that if any of the CEOs cracks and at least publicly responds the hunger strikers, then the CEOs who don't do so will look villainous, so you actually only need to have one of them respond to get a wedge in.
I spoke to Michaël in person before he started. I told him I didn't think the game theory worked out (if he's not willing to die, GDM should ignore him; if he does die then he's worsening the world, since he can definitely contribute better by being alive, and GDM should still ignore him). I don't think he's going to starve himself to death or serious harm, but that does make the threat empty. I don't really think that matters too much on a game-theoretic-reputation method since nobody seems to be expecting him to do that.
His theory of change was basically "If I do this, other people might" which seems to be true: he did get another person involved. That other person has said they'll do it for "1-3 weeks" which I would say is unambiguously not a threat to starve oneself to death.
As a publicity stunt it has kinda worked in the basic sense of getting publicity. I think it might change the texture and vibe of the AI protest movement in a direction I would prefer it to not go in. It certainly moves the salience-weighted average of public AI advocacy towards Stop AI-ish things.
I think that AI companies being governed (in general) is marginally better than them not being governed at all, but I also expect that the AI governance that occurs will look more like "AI companies have to pay X tax and heed Y planning system" which still leads to AI(s) eating ~100% of the economy, while not being aligned to human values, and then the first coalition (which might be a singleton AI, or might not be) which is capable of killing off the rest and advancing its own aims will just do that, regulations be damned. I don't expect that humans will be part of the winning coalition that gets a stake in the future.
SpaceX doesn't run a country because rockets+rocket building engineers+money cannot perform all the functions of labour, capital, and government and there's no smooth pathway to them expanding that far. Increasing company scale is costly and often decreases efficiency; since they don't have a monopoly on force, they have to maintain cost efficiency and can't expand into all the functions of government.
An AGI has the important properties of labour and capital and government (i.e. no "Lump of Labour" so it does 't devalue the more of it there is, but it can be produced at scale by more labour, but also it can organize itself without external coordination or limitations). I expect any AGI which has these properties to very rapidly outscale all humans, regardless of starting conditions, since the AGI won't suffer from the same inefficiencies of scale or shortages of staff.
I don't expect AGIs to respect human laws and tax codes once they have the capability to just kill us.
I would be interested to know how you think things are going to go in the 95-99% of non-doom worlds. Do you expect AI to look like "ChatGPT but bigger, broader, and better" in the sense of being mostly abstracted and boxed away into individual usage cases/situations? Do you expect AIs to be ~100% in command but just basically aligned and helpful?
I have now run some controls, the data has been added. Non-scatological data does not cause the same level of EM. One thing I did notice was that the scatological fine-tuning started with a loss of ~6 nats, while the control fine-tuning started with ~3 nats, and both went down to ~1 nat by the end. So the scatological was in some sense a 2-3x larger delta to the model. I don't think this makes all the difference, but it does bring into question whether or not this exact control is appropriate. When doing e.g. steering vectors, the appropriate control is a random vector of the same magnitude as the steering vector.
RE part 6:
I think there's a more intuitive/abstract framing here. If a model has only seen e_2 with respect to two different facts, it probably won't have generated an abstraction for e_2 in its world model at all. An abstraction is mostly useful as a hub of different inferences, like in the old blegg/rube diagram.
Something which has come up in pretraining will already be an abstraction with an easy-to-reach-for handle that the model can pull.
Might be testable by fine-tuning on only some of (or some pairs of) the spokes of a blegg/rube diagram, to see whether the final spoke-pairs fill in.
I.e.
"This object is round, so it's a blegg, so it's blue"
"This object is smooth, so it's a blegg, so it's round"
"This object is smooth, so it's a blegg, so it's bouncy"
"This object is round, is it bouncy?"
Something like that might cause "blegg" to be bound up and assembled into an abstraction in the AI, with a single representation.
Overall I consider this work to be weak evidence in favour of multi-step reasoning being an issue, since the latter parts show that it definitely can occur (just not if both facts are fine-tuned separately)