All of simeon_c's Comments + Replies

Cool thanks. 
I've seen that you've edited your post. If you look at ASL-3 Containment Measures, I'd recommend considering editing away the "Yay" aswell. 
This post is a pretty significant goalpost moving. 

While my initial understanding was that the autonomous replication would be a ceiling, this doc now made it a floor. 

So in other words, this paper is proposing to keep navigating beyond levels that are considered potentially catastrophic, with less-than-military-grade cybersecurity, which makes it very likely that at least one state, an... (read more)

2Zach Stein-Perlman11d
Not well; almost all of pp. 2–9 or maybe 2–13 is relevant. But here are some bits:

I agree with this general intuition, thanks for sharing. 

 

I'd value descriptions specific failures you could expect from an LLM which has been tried to be RLHF-ed against "bad instrumental convergence" but where we fail/ or a better sense of how you'd guess it would look like on an LLM agent or a scaled GPT. 

2Charlie Steiner15d
I expect you'd get problems if you tried to fine-tune a LLM agent to be better at tasks by using end-to-end RL. If it wants to get good scores from humans, deceiving or manipulating the humans is a common strategy (see "holding the claw between the camera and the ball" from the original RLHF paper). LLMs trained purely predictively are, relative to RL, very safe. I don't expect real-world problems from them. It's doing RL against real-world tasks that's the problem. RLHF can itself provide an RL signal based on solving real-world tasks. Doing RLHF that provides a reward signal on some real-world task that's harder to learn than deceiving/manipulating humans will provide the AI a lot of incentive to deceive/manipulate humans in the real world.
6tailcalled15d
LLMs/GPTs get their capabilities not through directly pursuing instrumental convergence, but through mimicking humans who hopefully have pursued instrumental convergence (the whole "stochastic parrot" insight), so it's unclear what "bad instrumental convergence" even looks like in LLMs/GPTs or what it means to erase it. The closest thing I can see to answer the question is that LLMs sort of function as search engines and you want to prevent bad actors from gaining an advantage with those search engines so you want to censor stuff that is mostly helpful for bad activities. They seem to have done quite well at that, so it seems basically feasible. Of course LLMs will still ordinarily empower bad actors just as they ordinarily empower everyone, so it's not a full solution. I don't consider this very significant though as I have a hard time imagining that stochastic parrots will be the full extent of AI forever.

I meant for these to be part of the "Standards and monitoring" category of interventions (my discussion of that mentions advocacy and external pressure as important factors).

I see, I guess where we might disagree is I think that IMO a productive social movement could want to apply the Henry Spira's playbook (overall pretty adversarial) oriented mostly towards slowing things down until labs have a clue of what they're doing on the alignment front. I would guess you wouldn't agree with that, but I'm not sure.

I think it's far from obvious that an AI company n

... (read more)

Thanks for the clarifications. 

But is there another "decrease the race" or "don't make the race worse" intervention that you think can make a big difference? Based on the fact that you're talking about a single thing that can help massively, I don't think you are referring to "just don't make things worse"; what are you thinking of?

1. I think we agree on the fact that "unless it's provably safe" is the best version of trying to get a policy slowdown. 
2. I believe there are many interventions that could help on the slowdown side, most of which are... (read more)

2HoldenKarnofsky2mo
Thanks for the response! Re: your other interventions - I meant for these to be part of the "Standards and monitoring" category of interventions (my discussion of that mentions advocacy and external pressure as important factors). I think it's far from obvious that an AI company needs to be a force against regulation, both conceptually (if it affects all players, it doesn't necessarily hurt the company) and empirically. Thanks for giving your take on the size of speedup effects. I disagree on a number of fronts. I don't want to get into the details of most of them, but will comment that it seems like a big leap from "X product was released N months earlier than otherwise" to "Transformative AI will now arrive N months earlier than otherwise." (I think this feels fairly clear when looking at other technological breakthroughs and how much they would've been affected by differently timed product releases.)

So I guess first you condition over alignment being solved when we win the race. Why do you think OpenAI/Anthropic are very different from DeepMind? 

8HoldenKarnofsky4mo
Noting that I don't think alignment being "solved" is a binary.  As discussed in the post, I think there are a number of measures that could improve our odds of getting early human-level-ish AIs to be aligned "enough," even assuming no positive surprises on alignment science. This would imply that if lab A is more attentive to alignment and more inclined to invest heavily in even basic measures for aligning its systems than lab B, it could matter which lab develops very capable AI systems first.
1Noosphere894mo
I don't exactly condition on alignment being solved. I instead point to a very important difference between OpenAI/Anthropic's AI vs Deepmind's AI, and the biggest difference between the two is that OpenAI/Anthropic's AI has a lot less incentive to develop instrumental goals due to having way fewer steps between the input and output, and incentivizes constraining goals, compared to Deepmind which uses RL, which essentially requires instrumental goals/instrumental convergence to do anything. This is an important observation by porby, which I'd lossily compress it to "Instrumental goals/Instrumental convergence is at best a debatable assumption for LLMs and Non-RL AI, and may not be there at all for LLMs/Non-RL AI." And this matters, because the assumption of instrumental convergence/powerseeking underlies basically all of the pessimistic analyses on AI, and arguably a supermajority of why AI is fundamentally dangerous, because instrumental convergence/powerseeking is essentially why it's so difficult to gain AI safety. LLMs/Non-RL AI probably bypass all of the AI safety concerns that isn't related to misuse or ethics, and this has massive implications. So massive, I covered them in it's own post here: https://www.lesswrong.com/posts/8SpbjkJREzp2H4dBB/a-potentially-high-impact-differential-technological One big implication is obvious: OpenAI and Anthropic are much safer companies to win the AI race, relative to Deepmind, because of the probably non-existent instrumental convergence/powerseeking issue. It also makes the initial alignment problem drastically easier, as it's a non-adversarial problem that doesn't need security mindset to make the LLM/Non-RL AI Alignment researcher plan work, as described here: https://openai.com/blog/our-approach-to-alignment-research And thus makes the whole problem easier as we don't need to worry much about the first AI researcher's alignment, resulting in a stable foundation for their recursive/meta alignment plan. The fact t

Thanks for writing that up. 

I believe that by not touching the "decrease the race" or "don't make the race worse" interventions, this playbook misses a big part of the picture of "how one single think could help massively". And this core consideration is also why I don't think that the "Successful, careful AI lab" is right. 

Staying at the frontier of capabilities and deploying leads the frontrunner to feel the heat which accelerates both capabilities & the chances of uncareful deployment which increases pretty substantially the chances of extinction.

2HoldenKarnofsky4mo
Thanks for this comment - I get vibes along these lines from a lot of people but I don't think I understand the position, so I'm enthused to hear more about it. > I believe that by not touching the "decrease the race" or "don't make the race worse" interventions, this playbook misses a big part of the picture of "how one single think could help massively".  "Standards and monitoring" is the main "decrease the race" path I see. It doesn't seem feasible to me for the world to clamp down on AI development unconditionally, which is why I am more focused on the conditional (i.e., "unless it's demonstrably safe") version.  But is there another "decrease the race" or "don't make the race worse" intervention that you think can make a big difference? Based on the fact that you're talking about a single thing that can help massively, I don't think you are referring to "just don't make things worse"; what are you thinking of? > Staying at the frontier of capabilities and deploying leads the frontrunner to feel the heat which accelerates both capabilities & the chances of uncareful deployment which increases pretty substantially the chances of extinction. I agree that this is an effect, directionally, but it seems small by default in a setting with lots of players (I imagine there will be, and is, a lot of "heat" to be felt regardless of any one player's actions). And the potential benefits seem big. My rough impression is that you're confident the costs outweigh the benefits for nearly any imaginable version of this; if that's right, can you give some quantitative or other sense of how you get there?
2Noosphere894mo
This is only true if we assume that there are little to no differences in which company takes the lead in AI, or which types of AI are preferable, and I think this is wrong, and there fairly massive differences between OpenAI or Anthropic winning the race, compared to Deepmind winning the race to AGI.

Extremely excited to see this new funder. 
I'm pretty confident that we can indeed find a significant number of new donors for AI safety since the recent Overton window shift. 

Chatting with people with substantial networks, it seemed to me like a centralized non-profit fundraising effort could probably raise at least $10M. Happy to intro you to those people if relevant @habryka

And reducing the processing time is also very exciting. 

So thanks for launching this.

6habryka4mo
Intros would be great! Now that we've launched I've been planning to reach out to more potential funders, and I think we will very likely get more good applications than we have funding for. Feel free to send me a DM or send me an email at habryka@lightspeedgrants.org to coordinate.

Thanks for writing this.

Overall, I don't like the post much under it's current form. There's ~0 evidence (e.g. from Chinese newspapers) and there is very little actual argumentation. I like that you give us a local view but putting a few links to back your claims would be very very appreciated. Right now it's hard to update on your post given that the claims are very empirical and without any external sources.

More minorly: "A domestic regulation framework for nuclear power is not a strong signal for a willingness to engage in nuclear arms reduction" I also disagree with this statement. I think it's definitely a signal.

@beren in this post, we find that our method (Causal Direction Extraction) allows to capture a lot of the gender difference with 2 dimensions in a linearly separable way. Skimming that post might of interest to you and your hypothesis. 

In the same post though, we suggest that it's unclear how much logit lens "works", to the extent that basically the direction encoding the best a same concept likely changes by a small angle at each layer, which causes two directions that best encode a concept at 15 layers of interval to have a cosine similarity <0.5... (read more)

I'd add that it's not an argument to make models agentic in the wild. It's just an argument to be already worried.

Thanks for writing that up Charbel & Gabin. Below are some elements I want to add.

In the last 2 months, I spent more than 20h with David talking and interacting with his ideas and plans, especially in technical contexts. 
As I spent more time with David, I got extremely impressed by the breadth and the depth of his knowledge. David has cached answers to a surprisingly high number of technically detailed questions on his agenda, which suggests that he has pre-computed a lot of things regarding his agenda (even though it sometimes look very weird on ... (read more)

I'll focus on 2 first given that it's the most important. 2. I would expect sim2real to not be too hard for foundations models because they're trained over massive distributions which allow and force to generalize to near neighbours. E.g. I think that it wouldn't be too hard for a LLMbto generalize some knowledge from stories to real life if it had an external memory for instance. I'm not certain but I feel like robotics is more sensitive to details than plans (which is why I'm mentioning a simulation here). Finally regarding long horizon I agree that it s... (read more)

Yes, I definitely think that countries with strong deontologies will try to solve some narrow versions of alignment harder than those that tolerate failures. 

I think it's quite reassuring and means that it's quite reasonable to focus on the US quite a lot in our governance approaches.

I think that this is misleading to state it that way. There were definitely dinners and discussions with people around the creation of OpenAI. 
https://timelines.issarice.com/wiki/Timeline_of_OpenAI 
Months before the creation of OpenAI, there was a discussion including Chris Olah, Paul Christiano, and Dario Amodei on the starting of OpenAI: "Sam Altman sets up a dinner in Menlo Park, California to talk about starting an organization to do AI research. Attendees include Greg Brockman, Dario Amodei, Chris Olah, Paul Christiano, Ilya Sutskever, and E... (read more)

2Paul Crowley5mo
Thanks, that's useful. Sad to see no Eliezer, no Nate or anyone from MIRI or having a similar perspective though :(

Also, I think that it's fine to have less chances of being an excellent alignment research for that reason. What matters is having impact, not being an excellent alignment researcher. E.g. I don't go full-in a technical career myself essentially for that reason, combined with the fact that I have other features that might allow me to go further in the impact tail in other subareas that are relevant. 

1Evan R. Murphy8mo
Thanks for clarifying that. I'm not very familiar with the IQ scores and testing, but it seems reasonable you could get rough estimates that way. Good point, there are lots of ways to contribute to reducing AI risk besides just doing technical alignment research.

If I try to think about someone's IQ (which I don't normally do, except for the sake of this message above where I tried to think about a specific number to make my claim precise) I feel like I can have an ordering where I'm not too uncertain on a scale that includes me, some common reference classes (e.g. the median student of school X has IQ Y), and a few people who did IQ tests around me. I'd by the way be happy to bet on anyone if someone accepted to reveal their IQ (e.g. from the list of SERI MATS's mentors) if you think my claim is wrong. 

2simeon_c8mo
Also, I think that it's fine to have less chances of being an excellent alignment research for that reason. What matters is having impact, not being an excellent alignment researcher. E.g. I don't go full-in a technical career myself essentially for that reason, combined with the fact that I have other features that might allow me to go further in the impact tail in other subareas that are relevant. 

Thanks for writing that. 

Three thoughts that come to mind: 

  • I feel like a more right claim is something like "beyond a certain IQ, we don't know what makes a good alignment researcher". Which I think is a substantially weaker claim than the one which is underlying your post. I also think that the fact that the probability of being a good alignment researcher increases with IQ is relevant if true (and I think it's very likely to be true, as for most sciences where Nobels are usually outliers along that axis). 
  • I also feel like I would expect pr
... (read more)
6Evan R. Murphy8mo
How do you (presume to) know people's IQ scores?

I think that yes it is reasonable to say that GPT-3 is obsolete. 
Also, you mentioned loads AGI startups being created in 2023 while it already happened a lot in 2022. How many more AGI startups do you expect in 2023? 

But I don't expect these kinds of understanding to transfer well to understanding Transformers in general, so I'm not sure it's high priority.

The point is not necessarily to improve our understanding of Transformers in general, but that if we're pessimistic about interpretability on dense transformers (like markets are, see below), we might be better off speeding up capabilities on architectures we think are a lot more interpretable.

3Fabien Roger9mo
I'm not saying that MoE are more interpretable in general. I'm saying that for some tasks, the high level view of "which expert is active when and where" may be enough to get a good sense of what is going on. In particular, I'm almost as pessimistic in finding "search", or "reward functions", or "world models", or "the idea of lying to a human for instrumental reasons" in MoEs as in regular Transformers. The intuition behind that is that MoE is about as useful when you want to do interp as the fact that there are multiple attention heads per Attention layer doing "different discrete things" (though they do things in parallel). The fact that there are multiple heads helps you a bit, but no that much. This is why I care about transferability of what you learn when it comes to MoEs. Maybe MoE + sth else could add some safeguards though (in particular, it might be easier to do targeted ablations on MoE than on regular Transformers), but I would be surprised if any safety benefit came from "interp on MoE goes brr". 

The idea that EVERY governments are dumb and won't figure out a way which is not too bad to allocate their resources into AGI seems highly unlikely to me. There seems to be many mechanisms by which it could not be the case (e.g national defense is highly involved and is a bit more competent, the strategy is designed in collaboration with some competent people from the private sector etc.). 

To be more precise, I'd be surprised if no one of these 7 countries had an ambitious plan which meaningfully changed the strategic landscape post-2030: 

  • US 
  • Israel 
  • UK
  • Singapore
  • France
  • China 
  • Germany

I guess I'm a bit less optimistic on the ability of governments to allocate funds efficiently, but I'm not very confident in that. 

A fairly dumb-but-efficient strategy that I'd expect some governments to take is "give more money to SOTA orgs" or "give some core roles to SOTA orgs in your Manhattan Project". That seems likely to me and that would have substantial effects. 

2Donald Hobson9mo
They may well have some results. Dumping money on SOTA orgs just bumps compute a little higher. (and maybe data, if you are hiring lots of people to make data.)  It isn't clear why SOTA orgs would want to be in a govmnt Manhatten project. It also isn't clear if any modern government retains the competence to run one.  I don't expect governments to do either of these. You generated those strategies by sampling "dumb but effective" strategies. I tried to sample from "most of the discussion got massively side tracked into the same old political squabbles and distractions." 

Unfortunately, good compute governance takes time. E.g., if we want to implement hardware-based safety mechanisms, we first have to develop them, convince governments to implement them, and then they have to be put on the latest chips, which take several years to dominate compute. 

This is a very interesting point. 

I think that some "good compute governance" such as monitoring big training runs doesn't require on-chip mechanisms but I agree that for any measure that would involve substantial hardware modifications, it would probably take a lot of ... (read more)

What I'm confident in is that they're more likely to be ahead than now or within a couple years. As I said, otherwise my confidence is ~35% by 2035 that China catches up (or become better), which is not huge? 

My reasoning is that they've been better at optimizing ~everything than the US mostly because of their centralization and norms (not caring too much about human rights helps optimizing) which is why I think it's likely that they'll catch up. 

Mostly because they have a lot of resources and thus can weigh a lot in the race once they enter it. 

2Donald Hobson9mo
Sure governments have a lot of resources. What they lack is the smarts to effectively turn those resources into anything. So maybe some people in government think AI is a thing, others think it's still mostly hype. The government crafts a bill. Half the money goes to artists put out of work by stable diffusion. A big section details insurance liability regulations for self driving cars. Some more funding is sent to various universities. A committee is formed. This doesn't change the strategic picture much.

Thanks for your comment! 

I see your point on fear spreading causing governments to regulate. I basically agree that if it's what happens, it's good to be in a position to shape the regulation in a positive way or at least try to. I still think that I'm more optimistic about corporate governance which seems more tractable than policy governance to me. 

The points you make are good, especially in the second paragraph. My model is that if scale is all you need, then it's likely that indeed smaller startups are also worrying. I also think that there could be visible events in the future that would make some of these startups very serious contenders (happy to DM about that). 

Having a clear map of who works in corporate governance and who works more towards policy would be very helpful. Is there anything like a "map/post of who does what in AI governance" or anything like that? 

3Koen.Holtman9mo
Thanks! I am not aware of any good map of the governance field. What I notice is that EA, at least the blogging part of EA, tends to have a preference for talking directly to (people in) corporations when it comes to the topic of corporate governance. As far as I can see, FLI is the AI x-risk organisation most actively involved in talking to governments. But there are also a bunch of non-EA related governance orgs and think tanks talking about AI x-risk to governments. When it comes to a broader spectrum of AI risks, not just x-risk, there are a whole bunch of civil society organisations talking to governments about it, many of them with ties to, or an intellectual outlook based on, Internet and Digital civil rights activism.

Have you read note 2? If note 2 was made more visible, would you still think that my claims imply a too high certainty? 

2konstantin9mo
I didn't read it, this clarifies a lot! I'd recommend making it more visible, e.g., putting it at the very top of the post as a disclaimer. Until then, I think the post implies unreasonable confidence, even if you didn't intend to.

I hesitated on decreasing the likelihood on that one based on your consideration to be honest, but I still think that 30% of having strong effects is quite a lot because as you mentioned it requires the intersection of many conditions. 

In particular, you don't mention which intervention you expect from them. If you take the intervention I took as a reference class ("Constrain labs to airgap and box their SOTA models while they train them”), do you think there are things that are as much or more "extreme" than this and that are likely? 

What might ... (read more)

3Koen.Holtman9mo
I think you are ignoring the connection between corporate governance and national/supra-national government policies. Typically, corporations do not implement costly self-governance and risk management mechanisms just because some risk management activists have asked them nicely. They implement them if and when some powerful state requires them to implement them, requires this as a condition for market access or for avoiding fines and jail-time. Asking nicely may work for well-funded research labs who do not need to show any profitability, and even in that special case one can have doubts about how long their do-not-need-to-be-profitable status will last. But definitely, asking nicely will not work for your average early-stage AI startup. The current startup ecosystem encourages the creation of companies that behave irresponsibly by cutting corners. I am less confident than you are that Deepmind and OpenAI have a major lead over these and future startups, to the point where we don't even need to worry about them. It is my assessment that, definitely in EA and x-risk circles, too few people are focussed on national government policy as a means to improve corporate governance among the less responsible corporations. In the case of EA, one might hope that recent events will trigger some kind of update.

Thanks for your comment! 

First, you have to have in mind that when people are talking about "AI" in industry and policymaking, they usually have mostly non-deep learning or vision deep learning techniques in mind simply because they mostly don't know the ML academic field but they have heard that "AI" was becoming important in industry. So this sentence is little evidence that Russia (or any other country) is trying to build AGI, and I'm at ~60% Putin wasn't thinking about AGI when he said that. 

If anyone who could play any role at all in develop

... (read more)
2Karl von Wendt9mo
As you point out yourself, what makes people interested in developing AGI is progress in AI, not the public discussion of potential dangers. "Nobody cared about" LLMs is certainly not true - I'm pretty sure the relevant people watched them closely. That many people aren't concerned about AGI or doubting its feasibility by now only means that THOSE people will not pursue it, and any public discussion will probably not change their minds. There are others who think very differently, like the people at OpenAI, Deepmind, Google, and (I suspect) a lot of others who communicate less openly about what they do. I don't think you can easily separate the scientific community from the general public. Even scientific papers are read by journalists, who often publish about them in a simplified or distorted way. Already there are many alarming posts and articles out there, as well as books like Stuart Russell's "Human Compatible" (which I think is very good and helpful), so keeping the lid on the possibility of AGI and its profound impacts is way too late (it was probably too late already when Arthur C. Clarke wrote "2001 - A  Space Odyssey"). Not talking about the dangers of uncontrollable AI for fear that this may lead to certain actors investing even more heavily in the field is both naive and counterproductive in my view. I will definitely publish it, but I doubt very much that it will have a large impact. There are many other writers out there with a much larger audience who write similar books. I'm currently in the process of translating it to English so I can do just that. I'll send you a link as soon as I'm finished. I'll also invite everyone else in the AI safety community (I'm probably going to post an invite on LessWrong). Concerning the Putin quote, I don't think that Russia is at the forefront of development, but China certainly is. Xi has said similar things in public, and I doubt very much that we know how much they currently spend on training their AIs. The qu

[Cross-posting my answer]
Thanks for your comment! 
That's an important point that you're bringing up. 

My sense is that at the movement level, the consideration you bring up is super important. Indeed, even though I have fairly short timelines, I would like funders to hedge for long timelines (e.g.  fund stuff for China AI Safety). Thus I think that big actors should have in mind their full distribution to optimize their resource allocation. 

That said, despite that, I have two disagreements: 

  1. I feel like at the individual level (i.e.
... (read more)

To get a better sense of people's standards' on "cut at the hard core of alignment", I'd be curious to hear examples of work that has done so.

It would be worth paying someone to do this in a centralized way:

  1. Reach out to authors
  2. Convert to LaTeX, edit
  3. Publish

If someone is interested in doing this, reach out to me (campos.simeon @gmail.com)

Do you think we could use grokking/current existing generalization phenomena (e.g induction heads) to test your theory? Or do you expect the generalizations that would lead to the sharp left turn to be greater/more significant than those that occurred earlier in the training? 

Thanks for trying! I don't think that's much evidence against GPT3 being a good oracle though, bc to me it's pretty normal that without fine-tuning he's not able to forecast. He'd need to be extremely sample efficient to be able to do that. Does anyone want to try fine-tuning?


Cost: You have basically 3 months free with GPT3 Davinci (175B) (under a given limit but which is sufficient for personal use) and then you pay as you go. Even if you use it a lot, you're likely to pay less than 5$ or 10$ per months. 
And if you have some tasks that need a lot of tokens but that are not too hard (e.g hard reading comprehension), Curie (GPT3 6B) is often enough and is much cheaper to use!

In few-shot settings (i.e a setting in which you show examples of something so that it reproduces it), Curie is often very good so it's worth trying it... (read more)

Thanks for the feedback! I will think about it and maybe try to do something along those lines!

Are there existing models for which we're pretty sure we know all their latent knowledge ? For instance small language models or something like that.

1Ajeya Cotra2y
[Paul/Mark can correct me here] I would say no for any small-but-interesting neural network (like small language models); I think like, linear regressions where we've set the features it's kind of a philosophical question (though I'd say yes). In some sense, ELK as a problem only even starts "applying" to pretty smart models (ones who can talk including about counterfactuals / hypotheticals, as discussed in this appendix.) This is closely related to how alignment as a problem only really starts applying to models smart enough to be thinking about how to pursue a goal.

Thanks for the answer! The post you mentioned indeed is quite similar!

Technically, the strategies I suggested in my two last paragraphs (Leverage the fact that we're able to verify solutions to problems we can't solve + give partial information to an algorithm and use more information to verify) should enable to go far beyond human intelligence / human knowledge using a lot of different narrowly accurate algorithms. 

And thus if the predictor has seen many extremely (narrowly) smart algorithms, it would be much more likely to know what is it like to be... (read more)

2Ajeya Cotra2y
I think this is roughly right, but to try to be more precise, I'd say the counterexample is this: * Consider the Bayes net that represents the upper bound of all the understanding of the world you could extract doing all the tricks described (P vs NP, generalizing from less smart to more smart humans, etc). * Imagine that the AI does inference in that Bayes net. * However, the predictor's Bayes net (which was created by a different process) still has latent knowledge that this Bayes net lacks. * By conjecture, we could not have possibly constructed a training data point that distinguished between doing inference on the upper-bound Bayes net and doing direct translation.

You said that naive questions were tolerated so here’s a scenario I can’t figure out why it wouldn’t work.

It seems to me that the fact that an AI fails to predict the truth (because it predicts as humans would) is due to the fact that the AI has built an internal model of how humans understand things and predict based on that understanding. So if we assume that an AI is able to build such an internal model, why wouldn’t we train an AI to predict what a (benevolent) human would say given an amount of information and a capacity to process information ? Doing... (read more)

1Ajeya Cotra2y
This proposal has some resemblance to turning reflection up to 11, and the key question you raise is the source of the counterexample in the worst case: Because ARC is living in "worst-case" land, they discard a training strategy once they can think of any at-all-plausible situation in which it fails, and move on to trying other strategies. In this case, the counterexample would be a reporter that answers questions by doing inference in whatever Bayes net corresponds to "the world-understanding that the smartest/most knowledgeable human in the world" has; this understanding could still be missing things that the prediction model knows. This is closely related to the counterexample "Gradient descent is more efficient than science" given in the report.

I think that "There are many talented people who want to work on AI alignment, but are doing something else instead." is likely to be true. I met at least 2 talented people who tried to get into AI Safety but who weren't able to because open positions / internships were too scarce. One of them at least tried hard (i.e applied for many positions and couldn't find one (scarcity), despite the fact that he was one of the top french students in ML). If there was money / positions, I think that there are chances that he would work on AI alignment independently.
Connor Leahy in one of his podcasts mentions something similar aswell.

That's the impression I have.

5adamShimi2y
I want to point out that cashing out "talented" might be tricky. My observation is that talent for technical alignment work is not implied/caused by talent in maths and/or ML. It's not bad to have any of this, but I can think of many incredible people in maths/ML I know who seem way less promising to me than some person with the right mindset and approach.