All of Wei_Dai's Comments + Replies


I don't think I understand, what's the reason to expect that the "acausal economy" will look like a bunch of acausal norms, as opposed to, say, each civilization first figuring out what its ultimate values are, how to encode them into a utility function, then merging with every other civilization's utility function? (Not saying that I know it will be the latter, just that I don't know how to tell at this point.)

Also, given that I think AI risk is very high for human civilization, and there being no reason to suspect that we're not a typical pre-AGI civiliz... (read more)

I think the acausal economy would look aggressively space expansionist/resource-exploitative (those are the ones that will acquire and therefore control the most resources; others will self-select out or be out-competed) and, if you're pessimistic about alignment, with some Goodharted human(-like) values from failed alignment (and possibly some bad human-like values). The Goodharting may go disproportionately in directions that are more resource-efficient and allow faster resource acquisition and use and successful takeover (against their creators and other AI). We may want to cooperate most with those using their resources disproportionately for artificial minds or for which there's the least opportunity cost to do so (say because they're focusing on building more hardware that could support digital minds).

To your first question, I'm not sure which particular "the reason" would be most helpful to convey.  (To contrast: what's "the reason" that physically dispersed human societies have laws?  Answer: there's a confluence of reasons.).  However, I'll try to point out some things that might be helpful to attend to.

First, committing to a policy that merges your utility function with someone else's is quite a vulnerable maneuver, with a lot of boundary-setting aspects.  For instance, will you merge utility functions multiplicatively (as in Nas... (read more)

What does merging utility functions look like and are you sure it's not going to look the same as global free trade? It's arguable that trade is just a way of breaking down and modularizing a big multifaceted problem over a lot of subagent task specialists (and there's no avoiding having subagents, due to the light speed limit)

That’s the path the world seems to be on at the moment. It might end well and it might not, but it seems like we are on track for a heck of a roll of the dice.

I agree with almost everything you've written in this post, but you must have some additional inside information about how the world got to this state, having been on the board of OpenAI for several years, and presumably knowing many key decision makers. Presumably this wasn't the path you hoped that OpenAI would lead the world onto when you decided to get involved? Maybe you can't share specific ... (read more)


We have a lot of experience and knowledge of building systems that are broadly beneficial and safe, while operating in the human capabilities regime.

What? A major reason we're in the current mess is that we don't know how to do this. For example we don't seem to know how to build a corporation (or more broadly an economy) such that its most powerful leaders don't act like Hollywood villains (race for AI to make a competitor 'dance')? Even our "AGI safety" organizations don't behave safely (e.g., racing for capabilities, handing them over to others, e.g.... (read more)

  Well, we are not very good at it, but generally speaking,  however capitalism seems to be acting to degrade our food, food companies are not knowingly routinely putting poisonous additives in food.  And however bad medicine is, it does seem to be a net positive these days. Both of these things are a big improvement on Victorian times! So maybe we are a tiny bit better at it than we used to be?  Not convinced it actually helps, mind....

Looking forward to your next post, but in the meantime:

  1. AI - Seems like it would be easier to build an AI that helps me get what I want, if "what I want" had various nice properties and I wasn't in “crossing that bridge when we come to it” mode all the time.
  2. meta-ethical uncertainty - I can't be sure there is no territory.
  3. ethics/philosophy as a status game - I can't get status from this game if I opt out of it.
  4. morality as coordination - I'm motivated to make my morality have various nice properties because it helps other people coordinate with me (by letting them better predict what I would do in various situations/counterfactuals).

My first thought upon hearing about Microsoft deploying a GPT derivative was (as I told a few others in private chat) "I guess they must have fixed the 'making up facts' problem." My thinking was that a big corporation like Microsoft that mostly sells to businesses would want to maintain a reputation for only deploying reliable products. I honestly don't know how to adjust my model of the world to account for whatever happened here... except to be generically more pessimistic?

Answer by Wei_DaiFeb 15, 20234-2

But it seems increasingly plausible that AIs will not have explicit utility functions, so that doesn’t seem much better than saying humans could merge their utility functions.

There are a couple of ways to extend the argument:

  1. Having an utility function (or some other stable explicit representation of values) is a likely eventual outcome of recursive self-improvement, since it makes you less vulnerable to value drift and manipulation, and makes coordination easier.
  2. Even without utility functions, AIs can try to merge, i.e., negotiate and jointly build s
... (read more)

Whereas shard theory seems aimed at a model of human values that’s both accurate and conceptually simple.

Let's distinguish between shard theory as a model of human values, versus implementing an AI that learns its own values in a shard-based way. The former seems fine to me (pending further research on how well the model actually fits), but the latter worries me in part because it's not reflectively stable and the proponents haven't talked about how they plan to ensure that things will go well in the long run. If you're talking about the former and I'm ... (read more)

I think of shard theory as more than just a model of how to model humans. My main point here is that human values will be represented in AIs in a form that looks a good deal more like the shard theory model than like a utility function. Approaches that involve utility functions seem likely to make alignment harder, via adding an extra step (translating a utility function into shard form) and/or by confusing people about how to recognize human values. I'm unclear whether shard theory tells us much about how to cause AIs to have the values we want them to have. Also, I'm not talking much about the long run. I expect that problems with reflective stability will be handled by entities that have more knowledge and intelligence than we have.

PBR-A, EGY, BTU, ARCH, AMR, SMR.AX, YAL.AX (probably not a good time to enter this last one) (Not investment advice, etc.)

My personal view is that given all of this history and the fact that this forum is named the "AI Alignment Forum", we should not redefine "AI Alignment" to mean the same thing as "Intent Alignment". I feel like to the extent there is confusion/conflation over the terminology, it was mainly due to Paul's (probably unintentional) overloading of "AI alignment" with the new and narrower meaning (in Clarifying “AI Alignment”), and we should fix that error by collectively going back to the original definition, or in some circumstances where the risk of confusion... (read more)

I don't think this is the main or only source of confusion: * MIRI folks also frequently used the narrower usage. I think the first time I saw "aligned" was in Aligning Superintelligence with Human Interests []from 2014 (scraped by wayback on January 3 2015 []) which says "We call a smarter-than-human system that reliably pursues beneficial goals “aligned with human interests” or simply “aligned.”" * Virtually every problem people discussed as part of AI alignment was also part of intent alignment. The name was deliberately chosen to evoke "pointing" your AI in a direction. Even in the linked post Eliezer uses "pointing the AI in the right direction" as a synonym for alignment. * It was proposed to me as a replacement for the narrower term AI control, which quite obviously doesn't include all the broader stuff. In the email thread where Rob suggested I adopt it he suggested it was referring to what Nick Bostrom called the "second principal-agent problem" between AI developers and the AI they build. I want to emphasize again that this definition seems extremely bad. A lot of people think their work helps AI actually produce good outcomes in the world when run, so pretty much everyone would think their work counts as alignment.  It includes all work in AI ethics, if in fact that research is helpful for ensuring that future AI has a good outcome. It also includes everything people work on in AI capabilities, if in fact capability increases improve the probability that a future AI system produces good outcomes when run. It's not even restricted to safety, since it includes realizing more upside from your AI. It includes changing the way you build AI to help address distributional issues, if the speaker (very reasonably!) thinks those

Other relevant paragraphs from the Arbital post:

“AI alignment theory” is meant as an overarching term to cover the whole research field associated with this problem, including, e.g., the much-debated attempt to estimate how rapidly an AI might gain in capability once it goes over various particular thresholds.

Other terms that have been used to describe this research problem include “robust and beneficial AI” and “Friendly AI”. The term “value alignment problem” was coined by Stuart Russell to refer to the primary subproblem of aligning AI preferences wit

... (read more)
My personal view is that given all of this history and the fact that this forum is named the "AI Alignment Forum", we should not redefine "AI Alignment" to mean the same thing as "Intent Alignment". I feel like to the extent there is confusion/conflation over the terminology, it was mainly due to Paul's (probably unintentional) overloading of "AI alignment" with the new and narrower meaning (in Clarifying “AI Alignment” []), and we should fix that error by collectively going back to the original definition, or in some circumstances where the risk of confusion is too great, avoiding "AI alignment" and using some other term like "AI x-safety". (Although there's an issue with "existential risk/safety" as well, because "existential risk/safety" covers problems that aren't literally existential, e.g., where humanity survives but its future potential is greatly curtailed. Man coordination is hard.)

Here are some clearer evidence that broader usages of "AI alignment" were common from the beginning:

  1. In this Arbital page dated 2015, Eliezer wrote:

The “alignment problem for advanced agents” or “AI alignment” is the overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.

(I couldn't find a easy way to view the original 2015 version, but do have a screenshot that I can produce upon request showing a Jan 2017 edit on Arbital that already had this broad def... (read more)

In the 2017 post Vladimir Slepnev is talking about your AI system having particular goals, isn't that the narrow usage? Why are you citing this here?
I misread the date on the Arbital page (since Arbital itself doesn't have timestamps and it wasn't indexed by the Wayback machine until late 2017) and agree that usage is prior to mine.
Other relevant paragraphs from the Arbital post:

Your main justification was that Eliezer used the term with an extremely broad definition on Arbital, but the Arbital page was written way after a bunch of other usage (including after me moving to I think).

Eliezer used "AI alignment" as early as 2016 and wasn't registered until 2017. Any other usage of the term that potentially predates Eliezer?

But that talk appears to use the narrower meaning though, not the crazy broad one from the later Arbital page. Looking at the transcript: * The first usage is "At the point where we say, “OK, this robot’s utility function is misaligned with our utility function. How do we fix that in a way that it doesn’t just break again later?” we are doing AI alignment theory." Which seems like it's really about the goal the agent is pursuing. * The subproblems are all about agents having the right goals. And it continuously talks about pointing agents in the right direction when talking informally about what alignment is. * It doesn't talk about how there are other parts of alignment that Eliezer just doesn't care about. It really feels like "alignment" is supposed to be understood to mean getting your AI to be not trying to kill you / trying to help you / something about its goals. * The talk doesn't have any definitions to disabuse you of this apparent implication. What part of this talk makes it seem clear that alignment is about the broader thing rather than about making an AI that's not actively trying to kill you?

I’m not sure what order the history happened in and whether “AI Existential Safety” got rebranded into “AI Alignment” (my impression is that AI Alignment was first used to mean existential safety, and maybe this was a bad term, but it wasn’t a rebrand)

There was a pretty extensive discussion about this between Paul Christiano and me. tl;dr "AI Alignment" clearly had a broader (but not very precise) meaning than "How to get AI systems to try to do what we want" when it first came into use. Paul later used "AI Alignment" for his narrower meaning, but after... (read more)

I don't think I really agree with this summary. Your main justification was that Eliezer used the term with an extremely broad definition on Arbital, but the Arbital page was written way after a bunch of other usage (including after me moving to I think). I think very few people at the time would have argued that e.g. "getting your AI to be better at politics so it doesn't accidentally start a war" is value alignment though it obviously fits under Eliezer's definition. (ETA: actually the Arbital page is old, it just wasn't indexed by the wayback machine and doesn't come with a date on Arbital itself. so So I agree with the point that this post is evidence for an earlier very broad usage.) I would agree with "some people used it more broadly" but not "clearly had a broader meaning." Unless "broader meaning" is just "used very vaguely such that there was no agreement about what it means." (I don't think this really matters except for the periodic post complaining about linguistic drift.)

UDT still has utility functions, even though it doesn't have independence... Is it just a terminological issue? Like you want to call the representation of value in whatever the correct decision theory turns out to be something besides "utility"? If so, why?

But where does UDT get those utility functions from, why does it care about expected utility specifically and not arbitrary preference over policies? Utility functions seem to centrally originate from updateful agents, which take many actions in many hypothetical situations, coherent with each other, forcing preference to be describable as expected utility. Such agents can then become reflectively stable by turning to UDT, now only ever taking a single decision about policy, in the single situation of total ignorance, with nothing else for it to be coherent with. So by becoming updateless, a UDT agent loses contact with the origin of (motivation for) its own utility function. To keep it, it would still implicitly need an updateful point of view, with its many situations that consitutute the affordance for acting coherently, to motivate its preference to have the specific form of expected utility. Otherwise it only has the one situation, and its preference and policy could be anything, with no opportunity to be constrained by coherence.
2Scott Garrabrant2mo
I think UDT as you specified it has utility functions. What do you mean by doesn't have independence? I am advocating for an updateless agent model that might strictly prefer a mixture between outcomes A and B to either A or B deterministically. I think an agent model with this property should not be described as having a "utility." Maybe I am conflating "utility" with expected utility maximization/VNM and you are meaning something more general?  If you mean by utility something more general than utility as used in EUM, then I think it is mostly a terminological issue.  I think I endorse the word "utility" without any qualifiers as referring to EUM. In part because I think that is how it is used, and in part because EUM is nice enough to deserve the word utility.

I agree that reflectivity for learned systems is a major open question, and my current project is to study the reflectivity and self-modification related behaviors of current language models.

Interesting. I'm curious what kinds of results you're hoping for, or just more details about your project. (But feel free to ignore this if talking about it now isn't a good use of your time.) My understanding is that LLMs can potentially be fine-tuned or used to do various things, including instantiating various human-like or artificial characters (such as a "helpf... (read more)

I don't see any clear-cut disagreement between my position and your White-Box Metaphilosophical AI. I wonder how much is just a framing difference? Reflective stability seems like something that can be left to a smarter-than-human aligned AI. I'm not saying it would be bad to implement a utility function in an AGI. I'm mostly saying that aiming for that makes human values look complex and hard to observe. E.g. it leads people to versions of the diamond alignment problem that sound simple, but which cause people to worry about hard problems which they mistakenly imagine are on a path to implementing human values. Whereas shard theory seems aimed at a model of human values that's both accurate and conceptually simple.

I feel deconfused when I reject utility functions, in favor of values being embedded in heuristics and/or subagents.

"Humans don't have utility functions" and the idea of values of being embedded in heuristics and/or subagents have been discussed since the beginning of LW, but usually framed as a problem, not a solution. The key issue here is that utility functions are the only reflectively stable carrier of value that we know of, meaning that an agent with a utility function would want to preserve that utility function and build more agents with the sam... (read more)

That's not necessarily a good thing. One man's reflective stability is another man's incorrigibility.
Some people on LW have claimed that reflective stability is essential. My impression is that Robin Hanson always rejected that. This seems like an important clash of intuitions. It seems to be Eliezer claiming that his utility function required it, and Robin denying that is a normal part of human values. I suspect this disagreement stems from some important disagreement about human values. My position seems closer to Robin's than to Eliezer's. I want my values to become increasingly stable. I consider it ok for my values to change moderately as I get closer to creating a CEV. My desire for stability isn't sufficiently strong compared to other values that I need guarantees about it.
This point seems anti-memetic for some reason. See my 4 karma answer here, on a post with 99 karma. [] 
9Scott Garrabrant2mo
I feel like reflective stability is what caused me to reject utility. Specifically, it seems like it is impossible to be reflectively stable if I am the kind of mind that would follow the style of argument given for the independence axiom. It seems like there is a conflict between reflective stability and Bayesian updating.  I am choosing reflective stability, in spite of the fact that loosing updating is making things very messy and confusing (especially in the logical setting), because reflective stability is that important. When I lose updating, the independence axiom, and thus utility goes along with it.
9Quintin Pope2mo
I agree that reflectivity for learned systems is a major open question, and my current project is to study the reflectivity and self-modification related behaviors of current language models. I don't think that the utility function framework helps much. I do agree that utility functions seem poorly suited to capturing human values. However, my reaction to that is to be more skeptical of utility functions. Also, I don't think the true solution to questions of reflectivity is to reach some perfected fixed point, after which your values remain static for all time. That doesn't seem human [], and I'm in favor of 'cloning the human prior []' as much as possible. It seems like a very bad idea to deliberately set out to create an agent whose mechanism of forming and holding values is completely different from our own. I also think utility functions are poorly suited to capturing the behaviors and values of current AI systems, which I take as another downwards update on the applicability of utility functions to capture such things in powerful cognitive systems. We can take a deep model and derive mathematical objects with properties that are nominally very close to utility functions. E.g., for transformers, it will probably soon be possible [] to find a reasonably concise energy function (probably of a similar OOM of complexity as the model weights themselves) whose minimization corresponds to executing forwards passes of the transformer. However, this energy function wouldn't tell you much about the values or objectives of the model, since the energy function is expressed in the ontology of model weights and activations, not an agent's beliefs / goals. I think this may be the closest we'll get to static, universally optimized for utility functions of realistic cognitive architectures. (And they're not e

I wrote about these when they first came out and have been wearing them since then, with good results. It looks like Amazon isn't selling them directly anymore (only via third-party sellers) but Northern Safety has them for $0.59 each, and sometimes for $0.19 each on sale. (The brand is different but it's the same mask.)

I think that multi-decision-influence networks seem much easier to align and much safer for humans.

It seems fine to me that you think this. As I wrote in a previous post, "Trust your intuitions, but don’t waste too much time arguing for them. If several people are attempting to answer the same question and they have different intuitions about how best to approach it, it seems efficient for each to rely on his or her intuition to choose the approach to explore."

As a further meta point, I think there's a pattern where because many existing (somewhat) conc... (read more)

I think these are great points, thanks for leaving this comment. I myself patiently await the possible day where I hit an obvious shard theory landmine which has the classic alignment-difficulty "feel" to it. That day can totally come, and I want to be ready to recognize if it does. FWIW I'm not intending to advocate "shard theory or GTFO", and agree that would be bad as a community policy.  I've tried to mention a few times[1] (but perhaps insufficiently prominently) that I'm less excited about people going "oh yeah I guess shard theory is great or something, let's just think about that now" and more excited about reactions like "Oh, I guess I should have been practicing more constant vigilance, time to think about alignment deeply on my own terms, setting aside established wisdom for the moment." I'm excited about other people thinking about alignment from first principles and coming up with their own inside views, with their own theories and current-best end-to-end pictures of AGI training runs [].  1. ^ From Inner and outer alignment decompose one hard problem into two extremely hard problems []: I've also made this point briefly at the end of in-person talks. Maybe I should say it more often.

I think that baggage is actually doing work in some people’s reasoning and intuitions.

Do you have any examples of this?

2Scott Garrabrant3mo
Hmm, examples are hard. Maybe the intuitions contribute to concept of edge instantiation []?

That said, I do think that multi objective optimisation is way more existentially safe than optimising for a single simple objective. I don’t actually think the danger directly translates. And I think it’s unlikely that multi-objective optimisers would not care about humans or other agents.

I think one possible form of existential catastrophe is that human values get only a small share of the universe, and as a result the "utility" of the universe is much smaller than it could be. I worry this will happen if only one or few of the objectives of multi obj... (read more)

I think that multi-decision-influence networks seem much easier to align and much safer for humans. 1. Even on the view you advocate here (where some kind of perfection is required), "perfectly align part of the motivations" seems substantially easier than "perfectly align all of the AI's optimization so it isn't optimizing for anything you don't want." []  2. all else equal, the more interests present at a bargaining table, the greater the chance that some of the interests are aligned. []  3. I don't currently understand what it means for the agent to have to care in "exactly the right way." I find myself wanting to point to Alignment allows "nonrobust" decision-influences [], but I know you've already read that post, so...  1. Perhaps I'm worried about implicit equivocation between  1. "the agent's values have to be adversarially robust so that they make decisions in precisely the right way, such that the plans which most appeal to the decision-maker are human-aligned plans" (I think this is wrong, as I argue in that post) 2. "There may be a lot of sensitivity of outcome-goodnesshuman to the way the human-aligned shards influence decisions, such that the agent has to learn lots of 'niceness parameters' in order to end up treating us well" (Seems plausible) 2. I think (i) is doomed and unnecessary and not how realistic agents work [], and I think (ii) might be very hard. But these are really not the same

Speaking for myself, I sometimes use "EU maximization" as shorthand for one of the following concepts, depending on context:

  1. The eventual intellectual descendant of EU maximization, i.e., the decision theory or theory of rationality that future philosophers will eventually accept as correct or ideal or normative, which presumably will have some kind of connection (even if only historical) to EU maximization.
  2. The eventual decision procedure of a reflectively stable superintelligence.
  3. The decision procedure of a very capable consequentialist AI, even if it'
... (read more)
6Scott Garrabrant3mo
I note that EU maximization has this baggage of never strictly preferring a lottery over outcomes to the component outcomes, and you steelmen appear to me to not carry that baggage. I think that baggage is actually doing work in some people's reasoning and intuitions.
I parsed the Rob Bensinger tweet [] I linked in the OP as being about expected utility maximising when I read it, but others have pointed out that wasn't necessarily a fair reading.

Overall I expect there to be a small number of massive training runs due to economies of scale, but I also expect AI developer margins to be reasonable, and I don’t see a strong reason to expect them to end up with way more power than other actors in the supply chain (either the companies who supply computing power,or the downstream applications of AI).

Is the reason that you expect AI developer margins to be reasonable that you expect the small number of AI developers to still compete with each other on price and thereby erode each other's margins? What... (read more)

Yes. A monopoly on computers or electricity could also take big profits in this scenario. I think the big things are always that it's illegal and that high prices drive new entrants. I think this would also be illegal if justified by the AI company's preferences rather than customer preferences, and it would at least make them a salient political target for people who disagree. It might be OK if they were competing to attract employees/investors/institutional customers and in practice I think it would be most likely happen as a move by the dominant faction in political/cultural conflict in a broader society, and this would be a consideration raising the importance of AI researchers and potentially capitalists in that conflict. I agree if you are someone who stands to lose from that conflict then you may be annoyed by some near-term applications of alignment, but I still think (i) alignment is distinct from those applications even if it facilitates them, (ii) if you don't like how AI empowers your political opponents then I strongly think you should push back on AI development itself rather than hoping that no one can control AI.

build and provide clean and effectively unlimited energy

How? The closest thing I can think of is nuclear fission but calling it "clean" seems controversial to say the least. Nuclear fusion seems a long way away from being economically viable. If you're talking about solar and wind, I think there are good arguments against calling it "effectively unlimited", or "clean" for that matter. See for a lecture about this.

I am referring to, among other things, humanity’s unfortunate retreat from space exploration.

C... (read more)

Solar is effectively unlimited. Yes, our storage capacity is not yet sufficiently developed to carry solar energy to the darker & colder periods, but that will get developed & installed as solar gets more and more competitive with other energy sources - which it rapidly is becoming year by year. More energy comes to the Earth from sunlight every year than we have received from fossil fuels throughout the entirety of human history.
I think that I would consider people's access to food "effectively unlimited" - the binding constraint is how much you have any reason to want, not cost. We are short of that for energy not just because energy is dirty, but because it's a limiting cost factor for many people's travel, etc. So I would claim that we don't need costs to come down by more than an order of magnitude to get there, and that's plausible with current technology - traditional nuclear, solar electric, solar thermal, geothermal, etc. (And there is a key issue with making the available clean energy usable in transportation - but it seems unrelated to the eventual limits.) Tow answer these in the wrong order, yes, I agree that there are many other goals we can and should pursue, but excepting a few areas where we don't know how to get what we want - AI Safety being the most critical example - I think we could saturate them in terms of funding and manpower without dedicating more than a moderate fraction of humanity's resources. But I view space exploration as both an investment into a prosperous longer term future, and a signpost. I'm not claiming that it has a near-term very large return, just that it's very clearly net positive. Given that, it's bizarre that we'd stop doing it, especially given how cheap government capital is and has been. (And see the comment here [] for more on that point.)

If we succeed at the technical problem of AI alignment, AI developers would have the ability to decide whether their systems generate sexual content or opine on current political events, and different developers can make different choices. Customers would be free to use whatever AI they want, and regulators and legislators would make decisions about how to restrict AI.

Presumably if most customers are able to find companies offering AIs that align sufficiently with their own preferences, there would be no backlash. The kind of backlash you're worried abo... (read more)

I don't really think that's the case.  Suppose that I have different taste from most people, and consider the interior of most houses ugly. I can be unhappy about the situation even if I ultimately end up in a house I don't think is ugly. I'm unhappy that I had to use multiple bits of selection pressure just to avoid ugly interiors, and that I spend time in other people's ugly houses, and so on. In practice I think it's even worse than that; people get politically worked up about things that don't affect their lives at all through a variety of channels. I do agree that backlash to X will be bigger if all AIs do X than if some AIs do X. I don't think this scenario is really relevant to the most common concerns about concentration of power. I think the most important reason to be scared of concentration of power is: * Historically you need a lot of human labor to get things done. * With AI the value of human labor may fall radically. * So capitalists may get all the profit, and it may be possible to run an oppressive state without a bunch of humans. * This may greatly increase economic inequality or make it much more possible to have robust oppressive regimes. But all of those arguments are unrelated to the number of AI developers. Overall I expect there to be a small number of massive training runs due to economies of scale, but I also expect AI developer margins to be reasonable, and I don't see a strong reason to expect them to end up with way more power than other actors in the supply chain (either the companies who supply computing power,or the downstream applications of AI). I don't think it's really plausible to have a technical situation where AI can be used to pursue "humanity's overall values" but cannot be used to pursue the values of a subset of humanity. (I also tend to think that technocratic solutions to empower humanity via the design of AI are worse than solutions that empower people in more legible ways, either by having their AI a

This piece is aimed at a broad audience, because I think it’s important for the challenges here to be broadly understood.

I'm curious how you're trying to reach such an audience, and what their reactions have been.

(Apologies for the late reply!) For now, my goal is to write something that interested, motivated nontechnical people can follow - the focus is on the content being followable rather than on distribution. I've tried to achieve this mostly via nontechnical beta (and alpha) readers. Doing this gives me something I can send to people when I want them to understand where I'm coming from, and it also helps me clarify my own thoughts (I tend to trust ideas more when I can explain them to an outsider, and I think that getting to that point helps me get clear on which are the major high-level points I'm hanging my hat on when deciding what to do). I think there's also potential for this work to reach highly motivated but nontechnical people who are better at communication and distribution than I am (and have seen some of this happening). I have the impression that these posts are pretty widely read in the EA community and at some AI labs, and have raised understanding and concern about misalignment to some degree.  I may explore more aggressive promotion in the future, but I'm not doing so now.

Let us assume that, on average, a booster given to a random person knocks you on your ass for a day. That’s one hundred years, an actual lifetime,of knocked-on-ass time for every hospitalization prevented. The torture here seems less bad than the dust specs.

What's your source for "booster given to a random person knocks you on your ass for a day"? None of my family had more than a sore arm.

For the more severe consequences, see also, which is one of the replies to the tweet you linked. (Don't have... (read more)

I am unconcerned enough about hospitalization that even if the RNA shots gave me anything but a sore arm that wakes me up the night after the shot when I roll over onto it, I would be analyzing it in terms of days of feeling cruddy from shot versus days of feeling cruddy from an infection, and timing them accordingly to maximize the odds of preventing feeling like crud for a week in a given year.
I know of a couple of people in my community who complained of this, but the rate I've observed is maybe an order of magnitude lower than what Zvi is suggesting.
1Lao Mein4mo
Sinovac, at least, gave a low-grade fever to everyone I knew who got it. There was an unspoken agreement in my workplace that anyone who took the vaccine could take the afternoon off for exactly this reason. Probably varies a lot from person to person.
0[comment deleted]4mo

calls out Bostrom as out of touch

I think he actually said that Bostrom represents the current zeitgeist, which is kind of the opposite of "out of touch"? (Unless he also said "out of touch"? Unfortunately I can't find a transcript to do a search on.)

It's ironic that everyone thinks of themselves as David fighting Goliath. We think we're fighting unfathomably powerful economic forces (i.e., Moloch) trying to build AGI at any cost, and Peter thinks he's fighting a dominant culture that remorselessly smothers any tech progress.


Accordingly, I think there’s a tendency to give OpenAI an unfair amount of flak compared to say, Google Brain or FAIR or any of the startups like Adept or Cohere.

I'm not sure I agree that this is unfair.

OpenAI is clearly on the cutting edge of AI research.

This is obviously a good reason to focus on them more.

OpenAI has a lot of visibility in this community, due to its physical proximity and a heavy overlap between OpenAI employees and the EA/Rationalist social scene.

Perhaps we have responsibility to scrutinize/criticize them more because of this... (read more)


I guess it depends on the specific alignment approach being taken, such as whether you're trying to build a sovereign or an assistant. Assuming the latter, I'll list some philosophical problems that seem generally relevant:

  1. metaphilosophy
    • How to solve new philosophical problems relevant to alignment as they come up?
    • How to help users when they ask the AI to attempt philosophical progress?
    • How to help defend the user against bad philosophical ideas (whether in the form of virulent memes, or intentionally optimized by other AIs/agents to manipulate the use
... (read more)

To the extent that alignment research involves solving philosophical problems, it seems that in this approach we will also need to automate philosophy, otherwise alignment research will become bottlenecked on those problems (i.e., on human philosophers trying to solve those problems while the world passes them by). Do you envision automating philosophy (and are you optimistic about this) or see some other way of getting around this issue?

It worries me to depend on AI to do philosophy, without understanding what "philosophical reasoning" or "philosophical p... (read more)

Insofar that philosophical progress is required, my optimism for AI helping on this is lower than for (more) technical research since in philosophy evaluation is often much harder and I'm not sure that it's always easier than generation. You can much more easily picture a highly charismatic AI-written manifesto that looks very persuasive and is very difficult to refute than it is to make technical claims about math, algorithms, or empirical data that are persuasive and hard to falsify. However, I'm skeptical that the list of novel philosophical problems we actually need to solve to prevent the most serious misalignment risk will actually be that long. For example, a lot of problems in rationality + decision theory + game theory I'd count more as model capabilities and the moral patienthood questions you can punt on for a while from the longtermist point of view.
What are the key philosophical problems you believe we need to solve for alignment?

If anyone here has attended a top university, was the effort to get in worth it to you (in retrospect)?

4Ben Pace4mo
I think so, yes? I found one of my great collaborators in life there, and I don't think I would have found a collaborator of a similar calibre at other universities. It was also a place with opportunity for me to build a great conference as a side-project, which depended on the prestige of the place and the fact that a few key orgs were based around that uni. (I went to Oxford Uni; FHI and CEA were there.) I am not a hard worker in general, and didn't work that hard to get there, having a certain level of base IQ was pretty key for me. Obviously everything the university had me do (or 'taught me') was a waste of time, but it was a schelling coordination place for smart and talented and prestige-seeking people.  Added: Not going to uni would also have been a very poor choice for me, I really damn care about immigration access to the US, and a technical degree is basically critical for me to have that.
No. I would wake up early to do homework, spend 8 to 10 hours at school (and often do homework during class), and then go home and do more homework. Looking back I got a lot more value out of my non-school activities and hobbies than I did out of doing homework, but there was just so little time for anything else. I was constantly stressed about missing deadlines and usually extremely tired. Meanwhile when my husband was in high school he did zero homework and actually did things in the world like building a house and starting a startup, which seems way better to me. And then my top university was a miserable place where I learned nothing and didn't even make any useful connections, and just suffered for four years while my parents paid hundreds of thousands of dollars for it, and THEN it turned out that ~all of the jobs I've ever had, including my first job out of college, I got based on performing well on work tests, which I could absolutely still have done if I'd gone to some small liberal arts school. No one has even asked to see my resumé since 2019 and all of my work has been completely unrelated to my degree. tl;dr school is a scam don't fucking go don't do it stay the fuck away
I'm not sure whether this counts: I went to the University of Cambridge, which is (especially for mathematics, which is what I studied there) probably the best university in the UK, with a worldwide reputation, but unlike top US universities it doesn't charge enormous fees. And I didn't put any particular effort into having a vast range of impressive extracurricular activities, I was just very good at mathematics. So the costs for me were very different from those for someone attending, say, Harvard. (Also, this was 30+ years ago, when even Harvard was much cheaper than now.) Anyway: yes, definitely worth it for me. I went there to learn a lot of mathematics, and that's what I did; some of the people who taught me were world-class mathematicians (though it will probably surprise no one to learn that being a world-class mathematician does not imply being a world-class teacher of mathematics); a lot of the people around me were super-smart or super-interesting or both. (Also, and not to be neglected, I made some good friends there and did a bunch of other enjoyable things, but that would probably have been much the same at any other university.) And, though it wasn't particularly why I chose to go there, I'm pretty sure it was good for my career -- though for obvious reasons it's hard to disentangle.
Yes. Some of that was luck, but I think the number of things you'd have to change before my limited strategy ability was better applied outside college than in was very high. But I do wish that long list of things had been different.
Not exactly your question, but I did poorly in a state university, and dropped out after two years.  I figure it cost me a decade of support and ancillary jobs before I fully self-educated in CS fundamentals and started my career as a professional software developer.   I've since been quite involved in hiring for large tech companies, and it's clear that a degree from a top-20 school is far better than from a random one, which is better than no degree, in terms of likelihood of being taken seriously as a junior applicant.  It takes 5-10 years of employment before it becomes irrelevant. Questions of "worth the effort" are very hard to answer, because it's impossible to know the counterfactuals.  On a purely financial and comfort level, it'd have been worth the effort for me to have done better in school, and actually graduated, preferably from a better school.  But that's mostly because it's easy to discount the cost to past-me of doing so.  In fact, I didn't and maybe couldn't.

For me the answer is yes, but my situation is quite non-central. I got into MIT since I was a kid from a small rural town with really good grades, really good test scores, and was on a bunch of sports teams. Because I was from a small rural town and was pretty smart, none of this required special effort other than being on sports teams (note: being on the teams required no special skill as everyone who tried out made the team given small class size). The above was enough to get me an admission probably for reasons of diversity I'm a white man but I'm fairl... (read more)

Thanks for these detailed explanations. Would it be fair to boil it down to: DL currently isn't very sample efficient (relative to humans) and there's a lot more data available for training generative capabilities than for training to self-censor and to not make stuff up? Assuming yes, my next questions are:

  1. How much more training data (or other effort/resources) do you think would be needed to solve these immediate problems (at least to a commercially acceptable level)? 2x? 10x? 100x?
  2. I'm tempted to generalize from these examples that unless something ma
... (read more)
I would wildly speculate that "simply" scaling up RLHF ~100x, while paying careful attention to rewarding models appropriately (which may entail modifying the usual training setup, as discussed in this comment []), would be plenty to get current models to express calibrated uncertainty well. However: * In practice, I think we'll make a lot of progress in the short term without needing to scale up this much by using various additional techniques, some that are more like "tricks" (e.g. teaching the model to generally express uncertainty when answering hard math problems) and some more principled (e.g. automating parts of the evaluation). * Even ~100x is still much less than pre-training (e.g. WebGPT used ~20k binary comparisons, compared to ~300b pre-training tokens for GPT-3). The difficulty of course is that higher-quality data is more expensive to collect. However, most of the cost of RLHF is currently employee hours and compute, so scaling up data collection ~100x might not be as expensive as it sounds (although it would of course be a challenge to maintain data quality at this scale). * Even though scaling up data collection will help, I think it's more important for labs to be prioritizing data quality (i.e. "reducing bias" rather than "reducing variance"): data quality issues are in some sense "scarier" in the long run, since they lead to the model systematically doing the wrong thing (e.g. deceiving the evaluators) rather than defaulting to the "safer" imitative pre-training behavior. * It's pretty unclear how this picture will evolve over time. In the long run, we may end up needing much less extremely high-quality data, since larger pre-trained models are more sample efficient, and we may get better at using techniques like automating parts of the evaluation. I've written mo

If I train a human to self-censor certain subjects, I'm pretty sure that would happen by creating an additional subcircuit within their brain where a classifier pattern matches potential outputs for being related to the forbidden subjects, and then they avoid giving the outputs for which the classifier returns a high score. It would almost certainly not happen by removing their ability to think about those subjects in the first place.

So I think you're very likely right about adding patches being easier than unlearning capabilities, but what confuses me is ... (read more)

Why do you say that it doesn't work as well? Or more specifically, why do you imply that humans are good at it? Humans are horrible at keeping secrets, suppressing urges or memories, etc., and we don't face nearly the rapid and aggressive attempts to break it that we're currently doing with ChatGPT and other LLMs.
8Eliezer Yudkowsky4mo
I think it's also that after you train in the patch against the usual way of asking the question, it turns out that generating poetry about hotwiring a car doesn't happen to go through the place where the patch was in.  In other words, when an intelligent agency like a human is searching multiple ways to get the system to think about something, the human can route around the patch more easily than other humans (who had more time to work and more access to the system) can program that patch in.  Good old Nearest Unblocked Neighbor.

We do have empirical evidence that nonrobust aligned intelligence can be not OK, like this or this. Why are you not more worried about superintelligent versions of these (i.e. with access to galaxies worth of resources)?

What do you mean by "nonrobust" aligned intelligence? Is "robust" being used in the "robust grading" sense, or in the "robust values" sense (of e.g. agents caring about lots of things, only some of which are human-aligned), or some other sense?  Anyways, responding to the vibe of your comment -- I feel... quite worried about that? Is there something I wrote which gave the impression otherwise? Maybe the vibe of the post is "alignment admits way more dof than you may have thought" which can suggest I believe "alignment is easy with high probability"?

Any thoughts why it's taking so long to solve these problems (reliably censoring certain subjects, avoiding hallucinations / making up answers)? Naively these problems don't seem so hard that I would have expected them to remain largely unsolved after several years while being very prominent and embarrassing for labs like OpenAI.

Also, given that hallucinations are a well know problem, why didn't OpenAI train ChatGPT to reliably say that it can sometimes make up answers, as opposed to often denying that? ("As a language model, I do not have the ability to m... (read more)

If you censor subjects without context, the AI becomes massively crippled, and will fail at things you want it to do. Let's take the example where someone told ChatGPT they owned a factory of chemicals, and were concerned about people breaking in to make meth, and hence wondering which chemicals they should particularly guard to prevent this. It is obvious to us as readers that this is a hack for getting meth recipes. But ChatGPT performs theory of mind at a level below a human nine year old; humans are fiendishly good at deception. So it falls for it. Now, you could stop such behaviour by making sure it does not talk about anything related to chemicals you can use to make meth, or opioids, or explosives, or poisons. But at this point, you have also made it useless for things like law enforcement, counter-terrorism, writing crime novels, supporting chemistry students, recommending pharmaceutical treatments, and securing buildings against meth addicts; like, related stuff is actually done, e.g. cashiers are briefed on combinations of items, or items purchased in large quantities, which they need to flag, report and stop because they are drug ingredients. Another problem is that teaching is what it should not do is giving it explicit information. E.g. it is very well and beautifully designed to counsel you against bullying people. As such, it knows what bullying looks like. And if you ask it what behaviours you should crack down on to prevent bullying... you get a guide for how to bully. Anything that just blindly blocks unethical advice based on keywords blocks a lot of useful advice. As a human, you have the ability to discuss anything, but you are judging who you are talking to and the context of the question when you weigh your answer, which is a very advanced skill, because it depends on theory of mind, in human at least. It is like the classic dilemma of updating to a better security system to imprison people; more sophisticated systems often come with more vuln

Among other issues, we might be learning this early item from a meta-predictable sequence of unpleasant surprises:  Training capabilities out of neural networks is asymmetrically harder than training them into the network.

Or put with some added burdensome detail but more concretely visualizable:  To predict a sizable chunk of Internet text, the net needs to learn something complicated and general with roots in lots of places; learning this way is hard, the gradient descent algorithm has to find a relatively large weight pattern, albeit presumably... (read more)

In addition to reasons other commenters have given, I think that architecturally it's a bit hard to avoid hallucinating. The model often thinks in a way that is analogous to asking itself a question and then seeing what answer pops into its head; during pretraining there is no reason for the behavior to depend on the level of confidence in that answer, you basically just want to do a logistic regression (since that's the architecturally easiest thing to say, and you have literally 0 incentive to say "I don't know" if you don't know!) , and so the model may... (read more)

My understanding of why it's especially hard to stop the model making stuff up (while not saying "I don't know" too often), compared to other alignment failures:

  • The model inherits a strong tendency to make stuff up from the pre-training objective.
  • This tendency is reinforced by the supervised fine-tuning phase, if there are examples of answers containing information that the model doesn't know. (However, this can be avoided to some extent, by having the supervised fine-tuning data depend on what the model seems to know, a technique that was employed here.)
  • I
... (read more)
2Adam Jermyn4mo
Roughly, I think it’s hard to construct a reward signal that makes models answer questions when they know the answers and say they don’t know when they don’t know. Doing that requires that you are always able to tell what the correct answer is during training, and that’s expensive to do. (Though Eg Anthropic seems to have made some progress here: []).

Not to put too fine a point on it, but you're just wrong that these are easy problems. NLP is hard because language is remarkably complex. NLP is also hard because it feels so easy from the inside -- I can easily tell what that pronoun refers to, goes the thinking, so it should be easy for the computer! But it's not, fully understanding language is very plausibly AI-complete.

Even topic classification (which is what you need to reliably censor certain subjects), though it seems simple, has literal decades of research and is not all that close to being solved.

So I think you should update much more towards "NLP is much harder than I thought" rather than "OpenAI should be embarrassed at how crappy their NLP is".

ChatGPT has a tendency to make up answers and refuse to admit it. A log follows. Note that almost every paragraph of the initial biography contains multiple falsehoods that are unlikely to be directly in its training data.

Show me a 10 paragraph biography of Wei Dai

Wei Dai was born in Guangzhou, China in 1967. She received her undergraduate degree in computer science from the University of Science and Technology of China in 1988 and her master's degree in computer science from the University of California, Berkeley in 1991.

After completing her master's ... (read more)

After market close on 10/26/2022, Meta guided an increase in annual capex of ~$4B (from 32-33 for 2022 to 34-39 for 2023), "with our investment in AI driving all of that growth". NVDA shot up 4% afterhours on this news. (Before you get too alarmed, I read somewhere that most of that is going towards running ML on videos, which is apparently very computationally expensive, in order to improve recommendations, in order to compete with TikTok. But one could imagine all that hardware being repurposed for something else down the line. Plus, maybe it's not a gre... (read more)

How much do you expect Meta to make progress on cutting edge systems towards AGI vs. focusing on product-improving models like recommendation systems that don’t necessarily advance the danger of agentic, generally intelligent AI? My impression earlier this year was that several important people had left FAIR, and then FAIR and all other AI research groups were subsumed into product teams. See []. I thought this would mean deprioritizing fundamental research breakthroughs and focusing instead on less cutting edge improvements to their advertising or recommendation or content moderation systems. But Meta AI has made plenty of important research contributions since then: Diplomacy, their video generator, open sourcing OPT and their scientific knowledge bot. Their rate of research progress doesn’t seem to be slowing, and might even be increasing. How do you expect Meta to prioritize fundamental research vs. product going forwards?

This reminds me of an example I described in this SL4 post:

After suggesting in a previous post [1] that AIs who want to cooperate with each other may find it more efficient to merge than to trade, I realized that voluntary mergers do not necessarily preserve Bayesian rationality, that is, rationality as defined by standard decision theory. In other words, two "rational" AIs may find themselves in a situation where they won't voluntarily merge into a "rational" AI, but can agree merge into an "irrational" one. This seems to suggest that we shouldn't expec

... (read more)
I remember my reaction when first reading this was "both AIs delegate their power, then a jointly trusted coinflip is made, then a new AI is constructed which maximizes one of the utility functions". That seems to solve the problem in general.
Why does the probability need to be close to 50% for the AI to agree to the merger? Shouldn't its threshold for agreeing to the merger depend on how likely one or the other AI is to beat the other in a war for the accessible universe? Is there an assumption that the two AIs are roughly equally powerful, and that a both-lose scenario is relatively unlikely?

Philosophers don’t discuss things which can be falsified.

Sometimes in life, one simply faces questions whose answers can't be falsified, such as "What should we do about things which can't be falsified?" If you're proposing to avoid discussing them, well aren't you discussing one of them now? And why should we trust you, without discussing it ourselves?

I think you had the bad luck of taking a couple of philosophy classes that taught things that were outdated or "insane". (Socrates and Aristotle may have been very confused, but consider, how did we, i.e.... (read more)

This is tempting, but the problem is that I don't know what my idealized utility function is (e.g., I don't have a specification for CEV that I think would be safe or ideal to optimize for), so what does it mean to try to approximate it? Or consider that I only read about CEV one day in a blog, so what was I doing prior to that? Or if I was supposedly trying to approximate CEV, I can change my mind about it if I realized that it's a bad idea, but how does that fit into the framework?

My own framework is something like this:

  • The evaluation process is some c
... (read more)

Imagine someone who considers a few plans, grades them (e.g. “how good does my gut say this plan is?”), and chooses the best. They are not a grader-optimizer. They are not trying to navigate to the state where they propose and execute a plan which gets maximally highly rated by some evaluative submodule. They use a grading procedure to locally rate and execute plans, and may even locally think “what would make me feel better about this plan?”, but the point of their optimization isn’t “find the plan which makes me feel as good as globally possible.”

The ... (read more)

I wrote in the post: Sorry if I'm just repeating something you read and understood, but I do feel like this criterion answers "no, this is still not grader-optimization; the effective search over lots of plans is still a side-effect of your cognition, not the terminal end."  In particular, note that the strategy you described would not strongly want to be given the actual-highest-rated plan--or maybe it would want to know more about the plan as a curiosity, but not in order to evaluate and execute that plan. That's one way in which saying "your strategy is not grader-optimization" constrains my anticipations in a useful-seeming way. This is a good point. I'm wondering about the type of the (presumably Cartesian) interface between the CEV-sim and the actor. First, CEV-sim shouldn't be affectable by the input-plan unless and until they run some stats on it. Otherwise the actor could (maybe?) side-channel attack them via whatever computer registers the input-plan shows up in. And CEV-sim does have to infer what they're being used for, at each invocation of the grader (since they don't retain memory across counterfactuals). a. That aside, if CEV-sim can just syntactically check whether the input-plan runs that, then your argument seems good.  b. If CEV-sim has to understand the actor's latent state context (is it all in the plan?), in order to make sure that the purported X-running plan isn't just running dangerous Y in another programming language... Seems like they can't do this.  I feel like we're dealing with (b) more than (a), so I'd say "no, 2 is safer than 1" tentatively.
Is your issue here that there exist a specific CEV-universe-simulation that makes 1 just as safe as 2, by basically emulating the latter situation? If so, why do you think this is a point against Alex's claim(which strikes me more as saying "there are a lot more cases of 2. being safe than of 1.")? 
6Vivek Hebbar4mo
Improve it with respect to what?   My attempt at a framework where "improving one's own evaluator" and "believing in adversarial examples to one's own evaluator" make sense: * The agent's allegiance is to some idealized utility function Uideal (like CEV).  The agent's internal evaluator Eval is "trying" to approximate Uideal by reasoning heuristically.  So now we ask Eval to evaluate the plan "do argmax w.r.t. Eval over a bunch of plans".  Eval reasons that, due to the the way that Eval works, there should exist "adversarial examples" that score very highly on Eval but low on Uideal.  Hence, Eval concludes that Uideal(plan) is low, where plan = "do argmax w.r.t. Eval".  So the agent doesn't execute the plan "search widely and argmax". * "Improving Eval" makes sense because Eval will gladly replace itself with Eval2 if it believes that Eval2 is a better approximation for Uideal (and hence replacing itself will cause the outcome to score better on Uideal) Are there other distinct frameworks which make sense here?  I look forward to seeing what design Alex proposes for "value child".

so making choices which drop the odds of success so drastically

I wouldn't say "drastically" here so maybe this is the crux. I think the chances of success if China does make an all out push for semiconductors is very low given its own resources and likely US and allies' responses (e.g. they could collectively way outspend China on their own subsidies). I could express this as <1% chance of having a world leading semi fab 10 years from now and <5% chance 20 years from now, no matter what China chooses to do at this point. If hegemony was the only g... (read more)

I agree with Rob Bensinger's response here, plus it's just a really weird use of "insane", like saying that Japan would have been insane not to attack Pearl Harbor after the US imposed an oil embargo on them, because "You miss 100% of the shots you don’t take." Thinking that way only makes sense if becoming a world or regional hegemon was your one and only goal, but how did that become the standard for sanity of world leaders around here?

It's not a weird use. It is a completely normal one: becoming the hegemon is the avowed goal and so making choices which drop the odds of success so drastically raise questions about the thought process, which in this case I find much better explained by simply mistaken beliefs and desires on Xi's part combined with ordinary sanity. I have already explained at length why I think those beliefs are mistaken, and I also think the desires are bad: do I think China (in the sense of its population as a whole) is better off for Xi's powerhunger, or that China would be better off for launching a scaling Manhattan Project? Certainly not: in my opinion, Xi's reign has made the Chinese people substantially worse off than they would have been under a more status quo leader, and Xi has probably single-handedly curtailed their long-term growth prospects & condemned them to permanent middle income status as they begin to enter a Japan-style malaise, in addition to authoritarian disasters like the Uighurs. But many Chinese believe otherwise now, and endorse further questing for hegemony, and they & Xi are in charge, not you, and desire what they desire, not what you desire - you (and Rob) are projecting your own preferences and moralizing here, instead of trying to understand what is going on.

I agree that humans sometimes fall prey to adversarial inputs, and am updating up on dangerous-thought density based on your religion argument. Any links to where I can read more?

Maybe (I don't mean read this book, which I haven't either, but you could use the wiki article to familiarize yourself with the historical episodes that the book talks about.) See also's_Gate_(religious_group)

However, this does not seem important for my

... (read more)

From the scaling-pilled perspective, or even just centrist AI perspective, this is an insane position: it is taking a L on one of, if not the most, important future technological capabilities, which in the long run may win or lose wars.

Are you suggesting that the sane policy is for Xi to dump in as much subsidies as needed until China catches up in semiconductors with the US and its allies? I haven't seen anyone else argue this, and it seems implausible to me, given that the latter collectively has much greater financial and scientific/engineering resou... (read more)

Separately from gwern's argument, I say that maintaining the gap is still of vital national interest. As an example, one of the arguments in favor of nuclear testing bans is that it unilaterally favors American nuclear supremacy, because only the US has the computational resources to conduct simulations good enough to be used in engineering new weapons. That logic was applied to Russia, but the same logic applies to China: advanced simulations are useful for almost every dimension of military competition. If they let advanced compute go, that means that the US will be multiple qualitative generations ahead in terms of our ability to simulate, predict, and test-without-risk. This is a terrible position to be in, geopolitically.

Are you suggesting that the sane policy is for Xi to dump in as much subsidies as needed until China catches up in semiconductors with the US and its allies? I haven't seen anyone else argue this

Yes. And perhaps no one else does because they aren't scaling proponents. But from a scaling perspective, accepting a permanent straitjacket around GPUs & a tightened noose is tantamount to admitting defeat & abandoning the future to other countries; it'd be like expelling all your Jewish scientists in 1935 & banning the mining of uranium. It's not t... (read more)

I wonder if given the COVID and real estate crises, Xi's government just doesn't have the financial resources to bail out the chips industry, plus maybe they (correctly?) understand that the likelihood of building an internationally competitive chips industry is poor (given the sanctions) even if they do dump in another $200b?

Also, it seems like China is being less antagonistic towards Taiwan and other countries in the last few days. Together with the lack of chips bailout, maybe it means they've realized that it was too early to "go loud" and are pivoting... (read more)

Thanks for this longer reply and the link to your diamond alignment post, which help me understand your thinking better. I'm sympathetic to a lot of what you say, but feel like you tend to state your conclusions more strongly than the underlying arguments warrant.

The adversarial optimization comes from other people who are optimizing ideas to get spurious buy-in from victims.

I think a lot of crazy religions/ideologies/philosophies come from people genuinely trying to answer hard questions for themselves, but there are also some that are deliberate atte... (read more)

I agree that humans sometimes fall prey to adversarial inputs, and am updating up on dangerous-thought density based on your religion argument. Any links to where I can read more? However, this does not seem important for my (intended) original point. Namely, if you're trying to align e.g. a brute-force-search plan maximizer or a grader-optimizer, you will fail due to high-strength optimizer's curse forcing you to evaluate extremely scary adversarial inputs. But also this is sideways of real-world alignment, where realistic motivations may not be best specified in the form of "utility function over observation/universe histories."  [] (Also, major religions are presumably memetically optimized. No deliberate choice required, on my model.) Answered now []. This seems disanalogous to the situation discussed in the OP. If we were designing, from scratch, a system which we wanted to pursue effective altruism, we would be extremely well-advised to not include grader-optimizers which are optimizing EA funder evaluations. Especially if the grader-optimizers will eventually get smart enough to write out the funders' pseudocode. At best, that wastes computation. At (probable) worst, the system blows up. By contrast, we live in a world full of other people, some of whom are optimizing for status and power. Given that world, we should indeed harden our evaluation procedures, insofar as that helps us more faithfully evaluate grants and thereby achieve our goals.

It would be far wiser to not consider all possible plans, and instead close off large parts of the search space. You can consider what plans to think about next, and how long to think, and so on. And then you aren’t argmaxing. You’re using resources effectively.

But aren't you still argmaxing within the space of plans that you haven't closed off (or are actively considering), and still taking a risk of finding some adversarial plan within that space? (Humans get scammed and invent or fall into cults and crazy ideologies not infrequently, despite doing wh... (read more)

I was primarily critiquing "argmax over the whole plan space." I do caution that I think it's extremely important to not round off "iterative, reflective planning and reasoning" as "restricted argmax", because that obscures the dynamics and results of real-world cognition. Argmax is also a bad model of what people are doing when they think, and how I expect realistic embedded agents to think. No, I mean: don't design agents which are motivated to find and exploit adversarial inputs. Don't align an agent to evaluations which are only nominally about diamonds, and then expect the agent to care about diamonds! You wouldn't align an agent to care about cows and then be surprised that it didn't care about diamonds. Why be surprised here?
Load More