All of Zac Hatfield-Dodds's Comments + Replies has a good summary of reasons to believe that human-created risks are much more likely than naturally-occuring risks like solar flares or asteroid or cometary impacts. If you'd like to read the book, which covers existential risks including from AI in more detail, I'm happy to buy you a copy. Specific to AI, Russel's Human Compatible and Christian's The Alignment Problem are both good too.

More generally it sounds like you're missing the ideas of the orthogonality thesis and convergent instrumental goals.

1Program Den15d
It might be fun to pair Humankind: A Hopeful History with The Precipice, as both have been suggested reading recently. It seems to me that we are, as individuals, getting more and more powerful. So this question of "alignment" is a quite important one— as much for humanity, with the power it currently has, as for these hypothetical hyper-intelligent AIs. Looking at it through a Sci-Fi AI lens seems limiting, and I still haven't really found anything more than "the future could go very very badly", which is always a given, I think. I've read those papers you linked (thanks!). They seem to make some assumptions about the nature of intelligence, and rationality— indeed, the nature of reality itself. (Perhaps the "reality" angle is a bit much for most heads, but the more we learn, the more we learn we need to learn, as it were. Or at least it seems thus to me. What is "real"? But I digress) I like the idea of Berserkers (Saberhagen) better than run amok Pi calculators… however, I can dig it. Self-replicating killer robots are scary. (Just finished Horizon: Zero Dawn - Forbidden West and I must say it was as fantastic as the previous installment!) Which of the AI books would you recommend I read if I'm interested in solutions? I've read a lot of stuff on this site about AI now (before I'd read mostly Sci-Fi or philosophy here, and I never had an account or interacted), most of it seems to be conceptual and basically rephrasing ideas I've been exposed to through existing works. (Maybe I should note that I'm a fan of Kurzweil's takes on these matters— takes which don't seem to be very popular as of late, if they ever were. For various reasons, I reckon. Fear sells.) I assume Precipice has some uplifting stuff at the end[1] [#fno0ag51u02fs], but I'm interested in AI specifically ATM. What I mean is, I've seen a few of proposals to "ensure" alignment, if you will, with what we have now (versus say warnings to keep in mind once we have AGI or are demonstrably close to i

I don't feel I can rule out slow/weird scenarios like those you describe, or where extinction is fast but comes considerably after disempowerment, or where industrial civilization is destroyed quickly but it's not worth mopping up immediately - "what happens after AI takes over" is by nature extremely difficult to predict. Very fast disasters are also plausible, of course.

I'm basing my impression here on having read much of Nate's public writing on AI, and a conversation over shared lunch at a conference a few months ago. His central estimate for is certainly substantially higher than mine, but as I remember it we have pretty similar views of the underlying dynamics to date, somewhat diverging about the likelihood of catastrophe with very capable systems, and both hope that future evidence favors the less-doom view.

  • Unfortunately I agree that "shut down" and "no catastrophe" are still missing pieces. I'm more opt
... (read more)

I'm one of the authors of Discovering Language Model Behaviors with Model-Written Evaluations, and well aware of those findings. I'm certainly not claiming that all is well; and agree that with current techniques models are on net exhibiting more concerning behavior as they scale up (i.e. emerging misbehaviors are more concerning than emerging alignment is reassuring). I stand by my observation that I've seen alignment-ish properties generalize about as well as capabilities, and that I don't have a strong expectation that this will change in future.

I als... (read more)

The key relaxation here is: deceptive alignment will not happen. In many ways, a lot of hopes are resting on deceptive alignment not being a problem. I disagree, since I think the non-myopia found is a key way for how something like goal misgeneralization or the sharp left turn could happen, where a model remains very capable, but loses it's alignment properties due to deceptive alignment.

That's not how I wrote the essay at all - and I don't mean to imply that the situation is good (I find 95% chance of human extinction in my lifetime credible! This is awful!). Hope is a attitude to the facts, not a claim about the facts; though "high confidence in [near-certain] doom is unjustified" sure is. But in the spirit of your fair question, here are some infelicitous points:

  • I find is less credible than (in the essay above). Conditional on no attempts at alignment, my inside view might be about - haven't tho
... (read more)

This is statistically neat, but I'd recommend Taleb's Statistical Consequences of Fat Tails: Real World Preasymptotics, Epistemology, and Applications - the most extreme cases are in practice always those for which you assumed the wrong distribution! e.g. there are many cases where a system spends most of it's time in a regime characterized by a normal distribution, and then rarely a different mechanism in the underlying dynamics shifts the whole thing wildly out of that - famously including the blowup of Long Term Capital Management after just four years.

It does make sense, and there's a way to do it!

  1. We're going to use Solomonoff induction, or (if you want it to be computable) an approximation like AIXI-tl, so we'll need a prior over all Turing machines. Let's go with the speed prior for now.
  2. At each bit, choose the bit which has lower probability according to this predictor.

This sequence is entirely deterministic, but can't be predicted without self-reference.

The Air Force/Texas Coronary Atherosclerosis Prevention Study exemplifies this perfectly in a cohort of 6605 patients. Before treatment both LDL-C and apoB are good predictors of major coronary events. After one year of treatment focused on LDL-C levels, their LDL-C levels stopped being predictive for a future major coronary event (P=.162), but their apoB levels were still a strong predictor (P=.001).

A few subtle but very important statistical nitpicks (full text of the study for reference):

  • Their sample size is large enough to get pretty reliable P-va
... (read more)
You make some good points, but thinking about the fact that researchers should correct for multiple-hypothesis testing always makes me a little sad—this almost never happens. Do you have an example where a study does this really nicely? Also, do you have any input on the hypothesis that treating early is a worthwhile risk?
1Lao Mein2mo
Wait, I thought LDL and HDL levels were considered vital to predicting future heart disease. Why would LDL/HDL levels during treatment be so unrelated to patient outcomes? LDL only decreased 25%. Weren't statins used to prevent heart disease specifically because of their effects on LDL/HDL levels?

Fixed, thanks; it links to the transformer circuits thread which includes both the induction heads paper, SoLU, and Toy Models of Superposition.

"Okay, you have found that one-layer transformer without MLP approximates skip-trigram statistics, how it generalizes to the question 'does GPT-6 want to kill us all?"?

I understand this is more an illustration than a question, but I'll try answering it anyway because I think there's something informative about different perspectives on the problem :-)

Skip-trigrams are a foundational piece of induction heads, which are themselves a key mechanism for in-context learning. A Mathematical Framework for Transformer Circuits was published less than a year ago... (read more)

Nitpick: This link probably meant to go to the induction heads and in context learning [] paper?

​Anthropic is still looking for a senior software engineer

As last time, Anthropic would like to hire many people for research, engineering, business, and operations roles. More on that here, or feel free to ask me :-)

Happily, the world's experts regularly compile consensus reports for the UN Framework Convention on Climate Change, and the Sixth Assessment Report is currently being finalized. While the full report is many thousands of pages, the "summary for policymakers" is an easy - sometimes boring - read at tens of pages.

  1. - The Physical Science Basis
  2. - Impacts, Adaptation and Vulnerability
  3. - Mitigation of Climate Change

I think you're asking about wg2 and w... (read more)

1Augustin Portier2mo
Thanks for your answer. I actually recently had an opportunity to skim through the Working Group 2’s full report, so I know there is some stuff there. I hadn’t thought of looking at the WG3’s, for some reason, even though it’s probably more useful for my question, so thanks for suggesting it.

I'm not surprised that if you investigate context-free grammars with two-to-six-layer transfomers you learn something very much like a tree. I also don't expect this result to generalize to larger models or more complex tasks, and so personally I find the paper plausible but uninteresting.

I do think the paper adds onto the pile of "neural networks do learn a generalizing algorithm" results. Notably, on Geoquery (the non--context-free grammar task), the goal is still to predict a parse tree of the natural sentence: Given that neural networks generalize, it's not surprising that a tree-like internal structure emerges on tasks that require a tree-like internal structure. Since we don't observe this in transformers trained on toy tasks without inherent tree-structure (e.g. Redwood's paren balancer) or on specific behaviors of medium LMs (induction, IOI, skip sequence counting), my guess is this is very much dependent on how "tree-like" the actual task is. My guess is that some parts of language modeling are indeed tree-like, so we'll see a bit of this, but it won't explain a large fraction of how the network's internals work. In terms of other evidence on medium-sized language models, there's also the speculative model from this post [] , which suggests that insofar as this tree structure is being built, it's being built in the earlier layers (1-8) and not later layers.

Please note that the inverse scaling prize is not from Anthropic:

The Inverse Scaling Prize is organized by a group of researchers on behalf of the Fund for Alignment Research (FAR), including Ian McKenzie, Alexander Lyzhov, Alicia Parrish, Ameya Prabhu, Aaron Mueller, Najoung Kim, Sam Bowman, and Ethan Perez. Additionally, Sam Bowman and Ethan Perez are affiliated with Anthropic; Alexander Lyzhov, Alicia Parrish, Ameya Prabhu, Aaron Mueller, Najoung Kim, Sam Bowman are affiliated with New York University. The prize pool is provided by the Future Fund.

A... (read more)

(if funding would get someone excited to do a great job of this, I'll help make that happen)

I'd be especially excited if this debate produced an adversarial-collaboration-style synthesis document, laying out the various perspectives and cruxes. I think that collapsing onto an optimism/pessimism binary loses a lot of important nuance; but also that HAIST reading, summarizing, and clearly communicating the range of views on RLHF could help people holding each of those views more clearly understand each other's concerns and communicate with each other.

2Sam Marks3mo
I agree that something like this would excellent. I unfortunately doubt that anything so cool will come out of this experiment. (The most important constraint is finding a HAIST member willing to take on the project of writing something like this up.) If things go well, we are tentatively planning on sharing the list of core disagreements we identify (these will probably look like cruxes and subquestions) as well as maybe data about our members' distribution of views before and after the debate.

No, this is also easy to work around; language models are good at deobfuscation and you could probably even do it with edit-distance techniques. Nor do you have enough volume of discussion to hide from humans literally just reading all of it; nor is Facebook secure against state actors, nor is your computer secure. See also Security Mindset and Ordinary Paranoia.

Bluntly: if you write it on Lesswrong or the Alignment Forum, or send it to a particular known person, governments will get a copy if they care to. Cybersecurity against state actors is really, really, really hard. Lesswrong is not capable of state-level cyberdefense.

If you must write it at all: do so with hardware which has been rendered physically unable to connect to the internet, and distribute only on paper, discussing only in areas without microphones. Consider authoring only on paper in the first place. Note that physical compromise of your home, w... (read more)

When walls don't work, can use ofbucsation? I have no clue about this, but wouldn't it be much easier to use pbqrjbeqf for central wurds necessary for sensicle discussion so that it wouldn't be sreachalbe, and then have your talkings with people on fb or something? Would be easily found if written on same devices or accounts used for LW, but that sounds easier to work around than literally only using paper?
Yep, we are definitely not capable of state-level or even "determined individual" level of cyberdefense.

Anthropic is hiring senior software engineers and many other positions including a recruiter and recruiting coordinator (meta-hiring!). More about that here.

I wouldn't usually post those links, but the blue dot site seems to be broken and the comment here might give the impression that we're only hiring for (specific) technical roles.

4Jamie Bernardi3mo
Thanks for pointing that out! This looks like what I hope to have been a temporary breakage when you made the comment. Admittedly I'm not too sure what changes were being made at the time you made this comment, so I hope it's nothing too sinister. I'd appreciate it if you contact me [] with some more info about your browser and location, if it still appears to be broken on your end. (It looks fine to me right now.) Additionally, since I'm here, I aim to add more of Anthropic's roles in the next few days; their omission isn't due to policy, rather trying to prioritise the order in which I build up the board and make it navigable for users looking for roles most relevant to them.
4Esben Kran3mo
Thank you for pointing it out! I've reached out to the BlueDot team as well and maybe it's a platform-specific issue since it looks fine on my end.

For Python basics, I have to anti-recommend Shaw's 'learn the hard way'; it's generally outdated and in some places actively misleading. And why would you want to learn the hard way instead of the best way in any case?

Instead, my standard recommendation is Al Sweigart's Automate the Boring Stuff and then Beyond the Basic Stuff (both readable for free on, or purchasable in books); he's also written some books of exercises. If you prefer a more traditional textbook, Think Python 2e is excellent and also available freely online.

Strong agree with not using "the hard way." Not a big fan. Automate the Boring Stuff isn't bad. But I think the best, direct-path/deliberate practice resource for general Python capability is Pybites ( + Anki. There are 300-400 exercises (with a browser interpreter) that require you to solve problems without them handholding or instructing you (so it's like LeetCode but for actual programming capability, not just algorithm puzzles). Everything requires you to read documentation or Google effectively, and they're all decent, realistic use cases. There are also specific paths (like testing, decorators & context management, etc.) that are covered in the exercises. My procedure: * Finish at least the Easy and at least half the Mediums for each relevant path. * If you learned something new, put it into Anki (I put the whole Pybite into a card and make a cloze-completion for the relevant code--I then have to type it for reviews). * Finish the rest of the Pybites (which will be an unordered mix of topics that includes the remaining Mediums and Hards for each of the learning paths, plus miscellaneous). * IMO, you will now actually be a solid low-intermediate Python generalist programmer, though of course you will need to learn lots of library/specialty-specific stuff in your own area.
4Neel Nanda3mo
Thanks! I learned Python ~10 years ago and have no idea what sources are any good lol. I've edited the post with your recs :)
  • I think this framing is accurate and important. Implications are of course "undignified" to put it lightly...
  • Broadly agree on upshot (1), though of course I hope we can do even better. (2) is also important though IMO way too weak. (Rule zero: ensure that it's never your lab that ends the world)
  • As usual, opinions my own.

One answer is that it might be soon enough that it's very much like right now, and so any contemporary fiction could do. Another is that "the future is already here, just unevenly distributed", and extrapolation can give you credible stories (so long as they're not set too close to an AI lab).

On longer-timelines views, a prerequisite to writing such a novel would be accurately forecasting the trajectory of AI research and deployment right up to the end. This is generally seen as rather difficult.

Mu. Ask "how does this knowledge actually affect pandemic response", not what I'd do. I claim personal expertise in neither case.

(If I could set policy we'd have driven at minimal cost (Covid was eradicable, and eliminated at several times in several countries!), had generic-coronavirus vaccine bases and pre-authorized challenge trials if that was infeasible, etc. In the world we're actually in, my understanding is that research like this feeds in to vaccine design, however far from ideal vaccine policy and delivery might be.)

Then why do you claim knowledge about it having effects? There are already lots of different coronaviruses out there in the wild. If you create a generic-coronavirus vaccine that works for all of those in the wild, what do you expect to gain from knowing that it works for a tiny (relative to the existing pool of coronaviruses) portion of new lab-created viruses?

Thanks for an excellent reply! One possible crux is that I don't think that synthesized human values are particularly useful; I'd expect that AGI systems can do their own synthesis from a much wider range of evidence (including law, fiction, direct observation, etc.). As to the specific points, I'd respond:

  • There is no unified legal theory precise enough to be practically useful for AI understanding human preferences and values; liberal and social democracies alike tend to embed constraints in law, with individuals and communities pursuing their values
... (read more)
4John Nay3mo
Thanks for the reply. 1. There does seem to be legal theory precise enough to be practically useful for AI understanding human preferences and values. To take just one example: the huge amount of legal theory on the how to craft directives. For instance, whether to make directives in contracts and legislation more of a rule nature or a standards nature. Rules (e.g., “do not drive more than 60 miles per hour”) are more targeted directives than standards. If comprehensive enough for the complexity of their application, rules allow the rule-maker to have more clarity than standards over the outcomes that will be realized conditional on the specified states (and agents’ actions in those states, which are a function of any behavioral impact the rules might have had). Standards (e.g., “drive reasonably” for California highways) allow parties to contracts, judges, regulators, and citizens to develop shared understandings and adapt them to novel situations (i.e., to generalize expectations regarding actions taken to unspecified states of the world). If rules are not written with enough potential states of the world in mind, they can lead to unanticipated undesirable outcomes (e.g., a driver following the rule above is too slow to bring their passenger to the hospital in time to save their life), but to enumerate all the potentially relevant state-action pairs is excessively costly outside of the simplest environments. In practice, most legal provisions land somewhere on a spectrum between pure rule and pure standard, and legal theory can help us estimate the right location and combination of “rule-ness” and “standard-ness” when specifying new AI objectives. There are other helpful legal theory dimensions to legal provision implementation related to the rule-ness versus standard-ness axis that could further elucidate AI design, e.g., “determinacy,” “privately adaptable” (“ru

Baseline again, I strongly support not doing any research that might collect, create, or leak pandemic pathogens. Defining this study as "not gain of function" is only defensible on technicalities that make the term almost meaningless.

It's unclear to me which headline you considered misleading. Which rationalist outlet said in a headline something about 80% MORTALITY without mentioning that it's in mice?

I'm not complaining about "literally failing to mention mice", but rather "without appearing to know that the wild-type strain was 100% lethal to the ... (read more)

The study and the reaction to it by the NHI are evidence that the system is malfunctioning. Researchers are still engaging in the type of research that can produce pandemics even if the chance of a single study doing so is likely less than 1%. The fact that this kind of research is done is scary. The fact that the wild-type strain was 100% doesn't really matter for the danger posed by the strain leaking. Sometimes there's a tradeoff between lethality and infectiousness for viruses. If the headline would instead be "Researchers choose a virus with 100% lethality in mice and thought it was a good idea to add the Omicron spike protein to it that likely makes it more infectious to humans" I don't know why that would scare us any less. The culture that produces this kind of apologism for dangerous research along with claims that speaking this way about dangerous research damages the credibility of the rationalist community is exactly where the problem is. Status-based claims of "you should not take about X because you might lose credibility " are what people did when we talked about masks early in the pandemic and at all sorts of points where rationalists differed from the orthodox wisdom. It's a necessary pre-condition to make any progress on sane progress of pandemic prevention to fight that kind of apologism. Zvi didn't even mention that the researchers thought it was a good idea not to do the experiment under biosafety level 4.

Re: that mouse study... I'm disappointed to feel I can't trust rationalist headlines without also personally checking for inconvenient details in the preprint and expert commentary (here's Derek Lowe in Science, or Helen Branswell in Stat.) For example:

... (read more)
I read a lot of Derek Lowe early in the pandemic and regard him highly, but in this case I think he's wrong. Going through the comments of Lowe's post, I came across a link to this essay by a distinguished biologist, Stephen Salzberg, at Johns Hopkins agreeing with Zvi's perspective. [] Salzberg is a computational biologist, not a virologist, but he's a distinguished professor at a prestigious school and does not seem to be on the fringe politically as far as I can tell If anybody knows more about him, please let me know. Overall, experts seem to be split on this matter. Which is strong enough evidence for me that the research should have been disallowed or at least regulated to the highest security level. The risks are just too great relative to what was learned from the research. I have written a letter to my representative in the House encouraging her to legislate more restrictions on gain of function research and referencing the article linked above.
Strong upvote. And well, your reply certainly builds up *my* trust of the rationalist community. Zvi should aim to maintain a healthy background level of disappointed readers to build up a robust peer-review population.
Yeah, I strongly downvoted this due to the clickbaity nature of it's and would encourage others to strong downvote it too.
What would you do differently in regards to pandemic response if you have that knowledge?
Discussing COVID gain-of-function papers is important whether or not there are 80% MORTALITY claims. If Zvi would have heard of the paper he would have written about it in either case. It's unclear to me which headline you considered misleading. Which rationalist outlet said in a headline something about 80% MORTALITY without mentioning that it's in mice?

(2) Many arguments for AGI misalignment depend on our inability to imbue AGI with a sufficiently rich understanding of what individual humans want and how to take actions that respect societal values more broadly.

(4) If AGI learns law, it will understand how to interpret vague human directives and societal human values well enough that its actions will not cause states of the world that dramatically diverge from human preferences and societal values.

I think (2) is straightforwardly false: the lethal difficulty is not getting something that understands w... (read more)

I don't think anyone is claiming that law is "always humane" or "always just" or anything of that nature.

This post is claiming that law is imperfect, but that there is no better alternative of a synthesized source of human values than democratic law. You note that law is not distinguished from "other forms of nonfiction or for that matter novels, poetry, etc" in this context, but the most likely second best source of a synthesized source of human values would not be something like poetry -- it would be ethics. And, there are some critical distinguishing fa... (read more)

I agree that getting an AGI to care about goals that it understands is incredibly difficult and perhaps the most challenging part of alignment. But I think John’s original claim is true: “Many arguments for AGI misalignment depend on our inability to imbue AGI with a sufficiently rich understanding of what individual humans want and how to take actions that respect societal values more broadly.” Here are some prominent articulations of the argument. Unsolved Problems in ML Safety (Hendrycks et al., 2021) says that “Encoding human goals and intent is challenging.” Section 4.1 is therefore about value learning. (Section 4.2 is about the difficulty of getting a system to internalize those values.) Robin Shah’s sequence on Value Learning argues that “Standard AI research will continue to make progress on learning what to do; catastrophe happens when our AI system doesn’t know what not to do. This is the part that we need to make progress on.” Specification Gaming: The Flip Side of AI Ingenuity (Krakovna et al., 2020) says the following: “Designing task specifications (reward functions, environments, etc.) that accurately reflect the intent of the human designer tends to be difficult. Even for a slight misspecification, a very good RL algorithm might be able to find an intricate solution that is quite different from the intended solution, even if a poorer algorithm would not be able to find this solution and thus yield solutions that are closer to the intended outcome. This means that correctly specifying intent can become more important for achieving the desired outcome as RL algorithms improve. It will therefore be essential that the ability of researchers to correctly specify tasks keeps up with the ability of agents to find novel solutions. John is suggesting one way of working on the outer alignment problem, while Zach is pointing out that inner alignment is arguably more dangerous. These are both fair points IMO. In my experience, people on this website often r

There's also the trivial point that human experts tend to specialise, so an AI system which can perform at the level of any human expert in their field will necessarily be far more capable than any particular human even without synergies between such areas of expertise or any generalisation.

Yep, this is a bigger deal than I realized last week.

Please don't describe times as "Summer 2023" for events that are not exclusive to one hemisphere! Summer is about six months offset between Australia and most of Asia, not to mention the areas that have a wet/dry seasonality rather than the four temperate seasons.

"Mid-2023" or a particular month are equally clear, and avoid the ambiguity :-)

modified. Thanks!

Musing on a piece in Communications of the ACM lately (Changing the Nature of AI Research) - I find this level of ~reframing or insistence on a mathematical perspective quite frustratingly political. ISTM that this just isn't how software or AI systems work! (at least, those which can survive outside academic papers)

Taking a step back, Four Cultures of Programming (a fantastic 75-page read) discusses hacker culture, engineering culture, managerial culture, and mathematical culture in programming. I'm so deep in hacker/engineer culture that it's hard to ... (read more) might be useful reading:

We describe cases where real recommender systems were modified in the service of various human values such as diversity, fairness, well-being, time well spent, and factual accuracy. From this we identify the current practice of values engineering: the creation of classifiers from human-created data with value-based labels. This has worked in practice for a variety of issues, but problems are addressed one at a time, and users and other stakeholders have seldom been involved. Instead, we look to AI

... (read more)

I think that convincing Chinese researchers and policy-makers of the importance of the alignment problem would be very valuable, but it also risks changing the focus of race dynamics to AGI, and is therefore very risky. The last thing you want to do is leave the CCP convinced that AGI is very important but safety isn't, as happened to John Carmack! Also beware thinking you're in a race.

I think (3) is only true over timescales of 2--5 years: the thing that really matters is performance per unit cost, and if you own the manufacturer you're not paying Nvidi... (read more)

Unfortunately for us LessWrongers, it's probably rational for them to believe that a race is happening, because to put it bluntly, China having AGI would become the world power that exceeds even the US. Problem is, the US government doesn't care about the alignment problem.

Distributed training runs never manage to fully utilize nominal flops from hardware, and are easy to stuff up in other ways too, but I'd expect the chips themselves to be pretty well set out - it's obvious early in the design stage if you're going to be bottlenecked on something else.

The relevant tooling already exists: is a 13b param model which would have been SOTA two years ago, trained on Chinese hardware, using a Chinese deep learning framework.

Anyone who thinks that compute restrictions will help with x-risk should explain how they plan to convince China that this isn't just a geopolitical ploy by the US to cripple their advanced industry and related military capabilities. (despite, you know, that being the motivation for export controls to date)

Wow - if those stats are correct, the training of CodeGeeX used up to 1e24 nominal flops (2.56e14 flops/chip * 1536 chips * 2.6e6 seconds), which would put it a bit ahead of Chinchilla, although its seemingly lite on param count. But it is somewhat easier to tile a chip with fp16 units then it is to utilize them effectively, so the true useful flops may be lower. Nonetheless, that's quite surprising, impressive, and perhaps concerning.
I’m open to the idea that this is good from an x-risk perspective but definitely not 100% sold on it. I agree with you that China knows this is directly aiming to cripple their advanced industry and related military capabilities. We’re entering an era of open hostilities and that’s not news to anyone on either side. I don’t think that necessarily means this is bad from an x-risk perspective. A few claims relevant to analyzing x-risk in the context of US-China policy: 1. AGI alignment is more likely if AGI is developed by US groups rather than Chinese groups, because some influential people in the US take alignment seriously while almost nobody in China does. 2. Slowing Chinese AI progress is good because it makes the US more likely to be the first to AGI, or gives more time for the necessary alignment work to happen and spread to China. 3. US policy can actually slow Chinese AI development with export bans on compute. I currently believe these three claims. CodeGeeX is pretty good evidence against 3, showing that Chinese compute and tooling is catching up to that of the US. But building high performance compute is a complicated supply chain with a lot of choke points, and it seems plausible that the US can slow Chinese progress by a few years. See e.g. this CSET article: [] IMO the strongest argument against this policy from an x-risk perspective is that this reduces future influence over Chinese AI development by using a one-time slowdown right now. If the most critical time for slowing AI progress is in the future, this bullet will no longer be in the chamber. But I also haven’t spent much time thinking about this and would welcome better arguments.

IMO scaling laws make this actually pretty easy. The curves are smooth, the relationship between capabilities of a smaller model and a less-trained larger model seem fairly reliable, and obviously you can just use your previous or smaller model to bootstrap and/or crosscheck the new overseer. Bootstrapping trust at all is hard to ground out (humans with interpretability tools?), and I expect oversight to be difficult, but I don't expect doing it from the start of training to present any particular challenge.

(opinions my own, you know the drill)

I think this is missing the point pretty badly for Anthropic, and leaves out most of the work that we do. I tried writing up a similar summary; which is necessarily a little longer:

Anthropic: Let’s get as much hands-on experience building safe and aligned AI as we can, without making things worse (advancing capabilities, race dynamics, etc). We'll invest in mechanistic interpretability because solving that would be awesome, and even modest success would help us detect risks before they become disasters. We'll train near-cutting-edge models to study ho

... (read more)

Thanks, Zac! Your description adds some useful details and citations.

I'd like the one-sentence descriptions to be pretty jargon-free, and I currently don't see how the original one misses the point. 

I've made some minor edits to acknowledge that the purpose of building larger models is broader than "tackle problems", and I've also linked people to your comment where they can find some useful citations & details. (Edits in bold)

Anthropic: Let’s interpret large language models by understanding which parts of neural networks are involved with certain

... (read more)

"10x engineer" is naturally measured in dollar-value to the company (or quality+speed proxies on a well-known distribution of job tasks), as compared to the median or modal employee, so I don't think that's a good analogy. Except perhaps inasmuch as it's a deeply disputed and kinda fuzzy concept!

-1M. Y. Zuo5mo
Right like most characteristics of human beings other than the ones subject to exact measurement, intelligence, 10x intelligence, 10x anything, etc., is deeply disputed and fuzzy compared to things like height, finger length, hair length, etc. So?

Unfortunately, the interpretable-composition hypothesis is simply wrong! Many bugs come from 'feature interactions', a term coined by by Pamela Zave - and if it wasn't for that, programming would be as easy as starting from e.g. Lisp's or Forth's tiny sets of primitives.

As to a weaker form, well, yes - using a 'cleanroom' approach and investing heavily in formal verification (and testing) can get you an orders-of-magnitude lower error rate than ordinary software... at orders-of-magnitude greater cost. At more ordinarily-reasonable levels of rigor, I'd em... (read more)

Good point about validators failing silently and being more strongly vetted. Abstractly, it seems to me that once the tooling and process is figured out for one task in a narrow domain, you could reuse that stuff on other tasks in the same domain at relatively low cost. But the history of repeated similar vulnerabilities over long time ranges in narrow domains (eg gnu core utils) is perhaps some evidence against that. I agree with the first half and would add that restricting the kind of interface has large additional benefit on top of the interfaces themselves. If you are dealing with classes and functions and singleton module things and macros then you're far more prone to error compared to just using any single one of those things. Even if they are all simple. If system validation has exponential cost with respect to confidence and system size then I think the simplicity of the primitives is perhaps uh the base of the exponent. And 1.011000 is a lot smaller than 21000. My main point is that this the difference between "annoying to validate" and "we will never ever be able to validate"

Then what does "10x average adult human intelligence" mean?

As written, it pretty clearly implies that intelligence is a scalar quantity, such that you can get a number describing the average adult human, one describing an AI system, and observe that the latter is twice or ten times the former.

I can understand how you'd compare two systems-or-agents on metrics like "solution quality or error rate averaged over a large suite of tasks", wall-clock latency to accomplish a task, or fully-amortized or marginal cost to accomplish a task. However, deriving a card... (read more)

-2M. Y. Zuo5mo
No? It's common to see a 10x figure used in connection with many other things that does not imply that intelligence is a scalar quantity. For example, a 10x software engineer. Nobody that I know interpret this as literally '10x more intelligent' then the average software engineer, it's understood to mean, ideally, 10x more productive. More often it's understood as vastly more productive. (And even if someone explicitly writes '10x more intelligent software engineer' it doesn't mean they are 10x scaler units of intelligence more so. Just that they are noticeably more intelligent, almost certainly with diminishing returns, potentially leading a roughly 10x productivity increase.) And since it's a common enough occupation nowadays, especially among the LW crowd, that I would presume the large majority of folks here would also interpret it similarly.

What does "10x average adult human intelligence" even mean? There are no natural units of intelligence!

1M. Y. Zuo5mo
I never implied there were natural units of intelligence? This is quite a bizarre thing to say or imply.

If SGD is approximately a Bayesian sampler, ...

I think it's worth noting that no large-scale system uses 'true' SGD; it's all ADAM-W and the weight decay seems like a strong part of the inductive bias. Of course "everything that works is approximately Bayesian", but the mathematics that people talk about with respect to SGD just aren't relevant to practice.

(opinions my own)

Broadly agree with this comment. I'd buy something like "low path-dependence for loss, moderate-to-high for specific representations and behaviours" - see e.g.

I'd love to see some replications of Anthropic's Induction Heads paper - it's based around models small enough to train on a single machine (and reasonable budget for students!), related to cutting-edge interpretability, and has an explicit "Unexplained Curiosities" section listing weird things to investigate in future work.

For readers not focussed on interpretability, I'd note that 'scaling laws go down as well as up' - you can do relevant work even on very small models, if you design the experiment well. Two I'd love to see are a replication of BERTs of... (read more)

Scenario A has an almost 10% chance of survival; the others ~0%. To quote John Wentworth's post, which I strongly agree with:

What I like about the Godzilla analogy is that it gives a strategic intuition which much better matches the real world. When someone claims that their elaborate clever scheme will allow us to safely summon Godzilla in order to fight Mega-Godzilla, the intuitively-obviously-correct response is “THIS DOES NOT LEAD TO RISING PROPERTY VALUES IN TOKYO”.

I think you've misunderstood the lesson, and mis-generalized from your experience of manually improving results.

First, I don't believe that you could write a generally-useful program to improve translations - maybe for Korean song lyrics, yes, but investing lots of human time and knowledge in solving a specific problem is exactly the class of mistakes the bitter lesson warns against.

Second, the techniques that were useful before capable models are usually different to the techniques that are useful to 'amplify' models - for example, "Let's think step by st... (read more)

I think your response shows I understood it pretty well. I used an example that you directly admit is against what the bitter lesson tries to teach as my primary example. I also never said anything about being able to program something directly better. I pointed out that I used the things people decided to let go of so that I could improve the results massively over the current state of the machine translation for my own uses, and then implied we should do things like give language models dictionaries and information about parts of speech that it can use as a reference or starting point. We can still use things as an improvement over pure deep learning, by simply letting the machine use them as a reference. It would have to be trained to do so, of course, but that seems relatively easy. The bitter lesson is about 'scale is everything,' but AlphaGo and its follow-ups use massively less compute to get up to those levels! Their search is not an exhaustive one, but a heuristic one that requires very little compute comparatively. Heuristic searches are less general, not more. It should be noted that I only mentioned AlphaGo to show that even it wasn't a victory of scale like some people commonly seem to believe. It involved taking advantage of the fact that we know the structure of the game to give it a leg up.

I'm not sure what you mean by "running on fewer resources" - it's a 175 billion parameter model, so it has the same minimum hardware requirements as GPT-3.

Their inference setup likely uses the same hardware per-response, but I'd guess OpenAI has much faster kernels so Meta would need more hardware to serve a given traffic rate. However, that's totally unrelated to the quality of each response.

Blenderbot is based on their largest "OPT" model, presumably fine-tuned somehow, after a training run which was explicitly imitating GPT-3 after they couldn't work ou... (read more)

It seems you are right. I thought previously that BlenderBot was supposed to be useable outside of the research enviroment. I read through the description and a key difference seems to be that Blenderbot actually has long-term memory about the user with whom it interacts.
  • Sure, suppose that the alignment problem is in the set of problems that a Bureaucracy Of AIs can solve. This sounds helpful because you've ~defined said bureaucracy to be safe, but I doubt it's possible to build a safe bureaucracy out of unsafe parts - and if it is, we don't know how to do so!
  • I dislike the fatalism here, and would rather celebrate direct attacks on the problem even when they don't work. For example, I'd love to see a more detailed writeup on BoAI proposals across a range of scenarios and safety assumptions :-)
The intended construction is to build a safer bureaucracy out of less safe parts/agents (or just less robustly safe ones). So they shouldn't break in most cases of running the bureaucracy, and the bureaucracy as a whole should break even less frequently. If the distillation of such a bureaucracy gives a safer part/agent than the original part, that is an iterative improvement. This doesn't need to change the game in one step, only improve the situation with each step, in a direction that is hard to formulate without resorting to the device of a bureaucracy. Otherwise this could be done with the more lightweight prompt/tuning setup, where the bureaucracy is just the prompt given to a single part/agent.
Load More