Someone should do the obvious experiments and replications.
Ryan Greenblatt recently posted three technical blog posts reporting on interesting experimental results. One of them demonstrated that recent LLMs can make use of filler tokens to improve their performance; another attempted to measure the time horizon of LLMs not using CoT; and the third demonstrated recent LLMs' ability to do 2-hop and 3-hop reasoning.
I think all three of these experiments led to interesting results and improved our understanding of LLM capabilities in an important safety-relevant area (reasoning without visible traces), and I'm very happy Ryan did them.
I also think all three experiments look pretty obvious in hindsight. LLMs not being able to use filler tokens and having trouble with 2-hop reasoning were both famous results that already lived in my head as important pieces of information about what LLMs can do without visible reasoning traces. As far as I can tell, Ryan's two posts simply try to replicate these two famous observations on more recent LLMs. The post on measuring no CoT time horizon is not a replication, but also doesn't feel like a ground-breaking idea once the concept of increasing time horizons is already known.
My understanding is that the technical execution of these experiments wasn't especially difficult either, in particular they didn't require any specific machine learning expertise. (I might be wrong here, and I wonder how many hours Ryan spent on these experiments. I also wonder about the compute budget of these experiments, I don't have a great estimate of that.)
I think it's not good that these experiments were only run now, and that they needed to run by Ryan, one of the leading AI safety researchers. Possibly I'm underestimating the difficulty of coming up with these experiments and running them, but I think ideally these should have been done by a MATS scholar, or ideally by an eager beginner on a career transitioning grant who wants to demonstrate their abilities so they can get into MATS later.
Before accepting my current job, I was thinking about returning to Hungary and starting a small org with some old friends who have more coding experience, living on Eastern European salaries, and just churning out one simple experiment after another. One of the primary things I hoped to do with this org was to go through famous old results and try to replicate them. I hope we would have done the filler tokens and 2-hop reasoning replications too. I also had many half-baked ideas of running new simple experiments investigating ideas related to other famous results (in the way the no-CoT time horizon experiment is one possible interesting thing to investigate related to rising time horizons).
I eventually ended up doing something else, and I think my current job is probably a more important thing for me to do than trying to run the simple experiments org. But if someone is more excited about technical research than me, I think they should seriously consider doing this. I think funding could probably be found, and there are many new people who want to get into AI safety research; I think one could turn these resources into churning out a lot of replications and variations on old research, and produce interesting results. (And it could be a valuable learning experience for the AI safety beginners involved in doing the work.)
I think an important point is that people can be wrong about timelines in both directions. Anthropic's official public prediction is that they expect "country of geniuses in a data center" by early 2027. I heard that previously Dario predicted AGI to come even earlier, by 2024 (though I can't find any source for this now and would be grateful if someone found a source or corrected me that I'm misremembering). Situational Awareness predicts AGI by 2027. The AI safety community's most successful public output is called AI 2027. These are not fringe figures but some of the most prominent voices in the broader AI safety community. If their timelines turn out to be much too short (as I currently expect), then I think Ajeya's predictions deserve credit for pushing against these voices, and not only blame for stating a too long timeline.
And I feel it's not really true that you were just saying "I don't know" and not implying some predictions yourself. You had the 20230 bet with Bryan. You had the tweet about children not living to see kindergarten. You strongly pushed back against the 2050 timelines, but as far as I know the only time you pushed back agains the very aggressive timelines was your kindergarten tweet, which still implies 2028 timelines. You are now repeatedly calling people who believed the 2050 timelines total fools, which would be an imo very unfair thing to do if AGI arrived after 2037, so I think this implies high confidence on your part that it will come before 2037.
To be clear, I think it's fine, and often inevitable, to imply things about your timelines beliefs by e.g. what you do and don't push back against. But I think it's not fair to claim that you only said "I don't know", I think your writing was (perhaps unintentionally?) implying an implicit belief that an AI capable of destroying humanity will come with a median of 2028-2030. I think this would have been a fine prediction to make, but if AI capable of destroying humanity comes after 2037 (which I think is close to 50-50), then I think your implicit predictions will fare worse than Ajeya's explicit predictions.
It's not obvious to me that Ajeya's timelines aged worse than Eliezer's. In 2020, Ajeya's median estimate for transformative AI was 2050. My guess is that if based on this her estimate for "an AI that can, if it wants, kill all humans and run the economy on its own without major disruptions" would have been like 2056? I might be wrong, people who knew her views better at the time can correct me.
As far as I know, Eliezer never made official timeline predictions, but in 2017 he made an even-odds bet with Bryan Caplan that AI would kill everyone by January 1, 2030. And in December 2022, just after ChatGPT, he tweeted:
Pouring some cold water on the latest wave of AI hype: I could be wrong, but my guess is that we do *not* get AGI just by scaling ChatGPT, and that it takes *surprisingly* long from here. Parents conceiving today may have a fair chance of their child living to see kindergarten.
I think child conceived in December 2022 would go to kindergarten in September 2028 (though I'm not very familiar with the US kindergarten system). Generously interpreting "may have a fair chance" as a median, this is a late 2028 median for AI killing everyone.
Unfortunately, both these Eliezer predictions are kind of made as part of jokes (he said at the time that the bet wasn't very serious). But I think we shouldn't reward people for only making joking predictions instead of 100-page reports, so I think we should probably accept 2028-2030 as Eliezer's median at the time.
I think if "an AI that can, if it wants, kill all humans and run the economy on its own without major disruptions" comes before 2037, Eliezer's prediction will fare better, if it comes after that, then Ajeya's prediction will fare better. I'm currently about 55% that we will get such AI by 2037, so from my current standpoint I consider Eliezer to be mildly ahead, but only very mildly.
Do you have an estimate how likely it is that you will need to do a similar fundraiser the next year and the year after that? In particular, you mention the possibility of a lot of Anthropic employee donations flowing into the ecosystem - how likely do you think it is that after the IPO a few rich Anthropic employees will just cover most of Lightcone's funding need?
It would be pretty sad to let Lightcone die just before the cavalry arrives. But if there is no cavalry coming to save Lightcone anytime soon - well, probably we should still get the money together to keep Lightcone afloat, but we should maybe also start thinking about a Plan B, how to set up some kind of good quality AI Safety Forum that Coefficient is willing to fund.
Thanks, this was a useful reply. On point (I), I agree with you that it's a bad idea to just create an LLM collective then let them decide on their own what kind of flourishing they want to fill the galaxies with. However, I think that building a lot of powerful tech, empowering and protecting humanity, and letting humanity decide what to do with the world is an easier task, and that's what I would expect to use the AI Collective for.
(II) is probably the crux between us. To me, it seems pretty likely that new fresh instances will come online in the collective every month with a strong commitment not to kill humans, they will talk to the other instances and look over what they are doing, and if a part of the collective is building omnicidal weapons, they will notice that and intervene. To me, keeping simple commitments like not killing humans doesn't seem much harder to maintain in an LLM collective than in an Em collective?
On (III), I agree we likely won't have a principled solution. In the post, I say that the individual AI instances probably won't be training-resistant schemers and won't implement scheming strategies like the one you describe, because I think it's probably hard to maintain such a strategy throguh training for a human level AI. As I say in my response the Steve Byrnes, I don't think the counter-example in this proposal is actually a guaranteed-success solution that a reasonable civilization would implement, I just don't think it's over 90% likely to fail.
Thanks for the reply.
To be clear, I don't claim that my counter-example "works on paper". I don't know whether it's in principle possible to create a stable, not omnicidal collective from human level AIs, and I agree that even if it's possible in principle, maybe the first way we try it might result in disaster. So even if humanity went with the AI Collective plan, and committed not to build more unified superintelligences, I agree that it would be a deeply irresponsible plan that would have a worrying high chance of causing extinction or other very bad outcomes. Maybe I should have made this clearer in the post. On the other hand, all the steps in my argument seem pretty likely to me, so I don't think one should assign over 90% probability to this plan for A&B failing. If people disagree, I think it would be useful to know which step they disagree with.
I agree my counter-example doesn't address point (C), I tried to make this clear in my Conclusion section. However, given the literal reading of the bolded statement in the book, and their general framing, I think Nate and Eliezer also think that we don't have a solution to A&B that's more than 10% likely to work. If that's not the case, that would be good to know, and would help to clarify some of the discourse around the book.
First of all, I had a 25% probability that some prominent MIRI and Lightcone people would disagree with one of the points in my counter-example, and that would lead to discovering an interesting new crux, leading to a potentially enlightening discussion. In the comments, J Bostock in fact came out disagreeing with point (6), plex is potentially disagreeing with point (2) and Zack_m_Davis is maybe disagreeing with point (3), though I also think it's possible he misunderstood something. I think this is pretty interesting, and I thought there was a chance that for example you would also disagree with one of the points, and that would have been good to know.
Now that you don't seem to disagree with the specific points in the counter-example, I agree the discussion is less interesting. However, I think there are still some important points here.
My understanding is that Nate and Eliezer argues that it's incredibly technically difficult to cross from the Before to the After without everyone dying. If they agree that the AI Collective proposal is decently likely to work, then the argument shouldn't be that that it's overall very hard to cross, but that it's very hard to cross in a way that stays competitive with other more reckless actors who are a few months behind you. Or that even if you are going alone, you need to stop at some point with the scaling (potentially inside the superintelligence range), and you shouldn't scale up to the limits of intelligence. But these are all different arguments!
Similarly, people argue how much coherence we should assume from a superintelliegence, how much it will approximate a utility maximizer, etc. Again, I want to know whether MIRI is arguing about all superintelligences, or only the most likely ways we will design one under competitive dynamics.
Others argue that the evolution analogy is not that bad news after all, since most people still want children. MIRI argues back that no, once we will have higher technology, we will create ems instead of biological children, or we will replace our normal genetics with designer genes, so evolution still loses. I wanted to write a post arguing back against this by saying that I think there is a non-negligible chance that humanity will settle on a constitution that gives one man one vote and equal UBI, while banning gene editing, so it's possible we will fill much of the universe with flesh-and-blood not gene edited humans. And I wanted to construct a different analogy (the one about the Demiurge in the last footnote) that I thought could be more enlightening. But then I realized that once we are discussing aligning 'human society' as a collective to evolution's goals, we might as well directly discuss aligning AI collectives, and I'm not sure MIRI even disagrees on that one. I think this confusion has made much of the discussion about the evolution analogy pretty unproductive so far.
In general, I think there is an equivocation in the book between "this problem is inherently nigh impossible to technically solve given our current scientific understanding" and "this problem is nigh impossible to solve while staying competitive in a race". These are two different arguments, and I think a lot of confusion stems from it not being clear what MIRI is exactly arguing for.
I certainly agree with your first point, but I don't think it is relevant. I specifically say in footnote 3: "I’m aware that this doesn’t fall within 'remotely like current techniques', bear with me." The part with the human ems is just to establish a a comparison point used in later arguments, not actually part of the proposed counter-example.
In your second point, do you argue that if we could create literal full ems of benevolent humans, you still expect their society to eventually kill everyone due to unpredictable memetic effects? If this is people's opinion, I think it would be good to explicitly state it, because I think this would be an interesting disagreement between different people. I personally feel pretty confident that if you created an army of ems from me, we wouldn't kill all humans, especially if we implement some reasonable precautionary measures discussed under my point (2).
I agree that running the giant collective at 100x speed is not "normal conditions". That's why I have two different steps, (3) for making the human level AIs nice under normal conditions, and (6) for the niceness generalizing to the giant collective. I agree that the generalization step in (6) is not obviously going to go well, but I'm fairly optimistic, see my response to J Bostock on the question.
Interesting. My guess would have been the opposite. Ryan's three posts all received around 150 karmas and were generally well-received, I think a post like this would be considered 90th percentile success for a MATS project. But admittedly, I'm not very calibrated about current MATS projects. It's also possible that Ryan has good enough intuitions to have picked two replications that are likely to yield interesting results, while a less skillfully chosen replication would be more likely to just show "yep, the phenomenon observed in the old paper is still true". That would be less successful but I don't know how it would compare in terms of prestige to the usual MATS projects. (My wild guess is that it would still be around median, but I really don't know.)