Wiki Contributions


I'm excited by many of the interventions you describe but largely for reasons other than buying time. I'd expect buying time to be quite hard, in so far as it requires coordinating to prevent many actors from stopping doing something they're incentivized to do. Whereas since alignment research community is small, doubling it is relatively easy. However, it's ultimately a point in favor of the interventions that they look promising under multiple worldviews, but it might lead me to prioritize within them differently to you.

One area I would push back on is the skills you describe as being valuable for "buying time" seem like a laundry list for success in research in general, especially empirical ML research:

Skills that seem uniquely valuable for buying time interventions: general researcher aptitudes, ability to take existing ideas and strengthen them, experimental design skills, ability to iterate in response to feedback, ability to build on the ideas of others, ability to draw connections between ideas, experience conducting “typical ML research,” strong models of ML/capabilities researchers, strong communication skills

It seems pretty bad for the people strongest at empirical ML research to stop doing alignment research. Even if we pessimistically assume that empirical research now is useless (which I'd strongly disagree with), surely we need excellent empirical ML researchers to actually implement the ideas you hope the people who can "generate and formalize novel ideas" come up with. There are a few aspects of this (like communication skills) that do seem to differentially point in favor of "buying time", maybe have a shorter and more curated list in future?

Separately given your fairly expansive list of things that "buy time" I'd have estimated that close to 50% of the alignment community are already doing this -- even if they believe their primary route to impact is more direct. For example, I think most people working on safety at AGI labs would count under your definition: they can help convince decision-makers in the lab not to deploy unsafe AI systems, buying us time. A lot of the work on safety benchmarks or empirical demonstrations of failure modes falls into this category as well. Personally I'm concerned people are falling into this category of work by default and that there's too much of this, although I do think when done well it can be very powerful.

I agree that in a fast takeoff scenario there's little reason for an AI system to operate withing existing societal structures, as it can outgrow them quicker than society can adapt. I'm personally fairly skeptical of fast takeoff (<6 months say) but quite worried that society may be slow enough to adapt that even years of gradual progress with a clear sign that transformative AI is on the horizon may be insufficient.

In terms of humans "owning" the economy but still having trouble getting what they want, it's not obvious this is a worse outcome than the society we have today. Indeed this feels like a pretty natural progression of human society. Humans already interact with (and not so infrequently get tricked or exploited by) entities smarter than them such as large corporations or nation states. Yet even though I sometimes find I've bought a dud on the basis of canny marketing, overall I'm much better off living in a modern capitalist economy than the stone age where humans were more directly in control.

However, it does seem like there's a lot of value lost in the scenario where humans become increasingly disempowered, even if their lives are still better than in 2022. From a total utilitarian perspective, "slightly better than 2022" and "all humans dead" are rounding errors relative to "possible future human flourishing". But things look quite different under other ethical views, so I'm reluctant to conflate these outcomes.

Thanks for this response, I'm glad to see more public debate on this!

The part of Katja's part C that I found most compelling was the argument that for a given AI system its best interests might be to work within the system rather than aiming to seize power. Your response argues that even if this holds true for AI systems that are only slightly superhuman, eventually we will cross a threshold where a single AI system can takeover. This seems true if we hold the world fixed -- there is some sufficiently capable AI system that can take over the 2022 world. But this capability threshold is a moving target: humanity will get better at aligning and controlling AI systems as we gain more experience with them, and we may be able to enlist the help of AI systems to keep others in check. So, why should we expect the equilibrium here to be an AI takeover, rather than AIs working for humans because that it is in their selfish best interest in a market economy where humans are currently the primary property owner?

I think the crux here is whether we expect AI systems to by default collude with one another. They might -- they have a lot of things in common that humans don't, especially if they're copies of one another! But coordination in general is hard, especially if it has to be surreptitious.

As an analogy, I could argue that for much of human history soldiers were only slightly more capable than civilians. Sure, a trained soldier with a shield and sword is a fearsome opponent, but a small group of coordinated civilians could be victorious. Yet as we develop more sophisticated weapons such as guns, cannons, missiles, the power that a single soldier has grows greater and greater. So, by your argument, eventually a single soldier will be powerful enough to take over the world.

This isn't totally fanciful -- the Spanish conquest of the Inca Empire started with just 168 soldiers! The Spanish fought with swords, crossbows, and lances -- if the Inca Empire were still around, it seems likely that a far smaller modern military force could defeat them. Yet, clearly no single soldier is in a position to take over the world, or even a small city. Military coup d'etats are the closest, but involve convincing a significant fraction of the military that is in their interest to seize power. Of course most soldiers wish to serve their nation, not seize power, which goes some way to explaining the relatively low rate of coup attempts. But it's also notable that many coup attempts fail, or at least do not lead to a stable military dictatorship, precisely because of difficulty of internal coordination. After all, if someone intends to destroy the current power structure and violate their promises, how much can you trust that they'll really have your back if you support them?

An interesting consequence of this is that it's ambiguous whether making AI more cooperative makes the situation better or worse.

Thanks for the quick reply! I definitely don't feel confident in the 20W number, I could believe 13W is true for more energy efficient (small) humans, in which case I agree your claim ends up being true some of the time (but as you say, there's little wiggle room). Changing it to 1000x seems like a good solution though which gives you plenty of margin for error.

This is a nitpick, but I don't think this claim is quite right (emphasis added)

 If a silicon-chip AGI server were literally 10,000× the volume, 10,000× the mass, and 10,000× the power consumption of a human brain, with comparable performance, I don’t think anyone would be particularly bothered—in particular, its electricity costs would still be below my local minimum wage!!

First, how much power does the brain use? 20 watts is StackExchange's answer, but I've struggled to find good references here. The appealingly named Appraising the brain's energy budget gives 20% of the overall calories consumed by the body, but that begs the question of the power consumption of the human body, and whether this is at rest or under exertion, etc. Still, I don't think the 20 watts figure is more than 2x off, so let's soldier on.

10,000 times 20 watts is 200 kW. That's a large but not insane amount of power. You could just about run that load on a domestic power supply in the US (some larger homes might have a 200A @ 120V circuit, for 192 kW of permissible load under the 80% rule). Of course you wouldn't be able to power the HVAC needed to cool all these chips, but let's suppose you live in Alaska and can just open the windows.

At the time of writing, the cheapest US electricity prices are around $0.09 per kWh with many states (including Alaska, unfortunately) being twice that at around $0.20/kWh. But let's suppose you're in both a cool climate and have a really great deal on electricity. So your 200kWh of chips costs you just $0.09*200=$18/hour.

Federal minimum wage is $7.25/hour, and the highest I'm aware of in any US state is $15/hour. So it seems that you won't be cheaper than the brain on electricity prices if 10,000 times less efficient. I've systematically tried to make favorable assumptions here. Your 200kW proto-AGI probably won't be in an Alaskan garage, but in a tech company's data center with according costs for HVAC, redundant power, security, etc. Colo costs vary widely depending on location and economies of scale. A recent quote I had was at around the $0.4 kWh/mark -- so about 4x the cost quoted above.

This doesn't massively change the qualitative takeaway, which is that even if something was 10,000 (or even a million times) less efficient than the brain, we'd absolutely still go ahead and build a demo anyway. But it is worth noting that something at the $60/hour range might not actually be all that transformative unless it's able to perform highly skilled labor -- at least until we make it more efficient (which would happen quite rapidly).

"The Floating Droid" example is interesting as there's a genuine ambiguity in the task specification here. In some sense that means there's no "good" behavior for a prompted imitation model here. (For an instruction-following model, we might want it to ask for clarification, but that's outside the scope of this contest.) But it's interesting the interpretation flips with model scale, and in the opposite direction to what I'd have predicted (doing EV calculations are harder so I'd have expected scale to increase not decrease EV answers.) Follow-up questions I'd be excited to see the author address include:

  1. Does the problem go away if we include an example where EV and actual outcome disagree? Or do the large number of other spuriously correlated examples overwhelm that?

  2. How sensitive is this to prompt? Can we prompt it some other way that makes smaller models more likely to do actual outcome, and larger models care about EV? My guess is the training data that's similar to those prompts does end up being more about actual outcomes (perhaps this says something about the frequency of probabilistic vs non-probabilistic thinking on internet text!), and that larger language models end up capturing that. But perhaps putting the system in a different "personality" is enough to resolve this. "You are a smart, statistical assistant bot that can perform complex calculations to evaluate the outcomes of bets. Now, let's answer these questions, and think step by step."

It's not clear to me how we can encourage rigor where effective without discouraging research on areas where rigor isn't currently practical. If anyone has ideas on this, I'd be very interested.

A rough heuristic I have is that if the idea you're introducing is highly novel, it's OK to not be rigorous. Your contribution is bringing this new, potentially very promising, idea to people's attention. You're seeking feedback on how promising it really is and where people are confused , which will be helpful for then later formalizing it and studying it more rigorously.

But if you're engaging with a large existing literature and everyone seems to be confused and talking past each other (which I'd characterize a significant fraction of the mesa-optimization literature, for example) -- then the time has come to make things more rigorous, and you are unlikely to make much further progress without it.

Work that is still outside the academic Overton window can be brought into academia if it can be approached with the technical rigor of academia, and work that meets academic standards is much more valuable than work that doesn't; this is both because it can be picked up by the ML community, and because it's much harder to tell if you are making meaningful progress if your work doesn't meet these standards of rigor.

Strong agreement with this! I'm frequently told by people that you "cannot publish" on a certain area, but in my experience this is rarely true. Rather, you have to put more work into communicating your idea, and justifying the claims you make -- both a valuable exercise! Of course you'll have a harder time publishing than on something that people immediately understand -- but people do respect novel and interesting work, so done well I think it's much better for your career than one might naively expect.

I especially wish there was more emphasis on rigor on the Alignment Forum and elsewhere: it can be valuable to do early-stage work that's more sloppy (rigor is slow and expensive), but when there's long-standing disagreements it's usually better to start formalizing things or performing empirical work than continuing to opine.

That said, I do think academia has some systemic blindspots. For one, I think CS is too dismissive of speculative and conceptual research -- much of this work will end up being mistaken admittedly, but it's an invaluable source of ideas. I also think there's too much emphasis on an "algorithmic contribution" in ML, which leads to undervaluing careful empirical valuations and understanding failure modes of existing systems.

I liked this post and think it'll serve as a useful reference point, I'll definitely send it to people who are new to the alignment field.

But I think it needs a major caveat added. As a survey of alignment research that regularly posts on LessWrong or interacts closely with that community, it does a fine job. But as capybaralet already pointed out, it misses many academic groups. And even some major industry groups are de-emphasized. For example, DeepMind alignment is 20+ people, and has been around for many years. But it's got if anything a slightly less detailed write-up than Team Shard, a small group of people for a few months, or infra-Bayesianism, largely one person for several years.

The best shouldn't be the enemy of the good, and some groups are just quite opaque, but I think it does need to be cleared about its limitations. One anti-dote would be including in the table a sense of # of people, # of years it's been around, and maybe even funding to get a sense of what the relative scale of these different projects are.

One omission from the list is the Fund for Alignment Research (FAR), which I'm a board member of. That's fair enough: FAR is fairly young, and doesn't have a research agenda per se, so it'd be hard to summarize their work from the outside!. But I thought it might be of interest to readers so I figured I'd give a quick summary here.

FAR's theory of change is to incubate new, scalable alignment research agendas. Right now I see a small range of agendas being pursued at scale (largely RLHF and interpretability), then a long tail of very diverse agendas being pursued by single individuals (mostly independent researchers or graduate students) or 2-3 person teams. I believe there's a lot of valuable ideas in this long tail that could be scaled, but this isn't happening due to a lack of institutional support. It makes sense that the major organisations want to focus on their own specific agendas -- there's a benefit to being focused! -- but it means a lot of valuable agendas are slipping through the cracks.

FAR's current approach to solving this problem is to build out a technical team (research engineers, junior research scientists, technical communication specialists) and provide support to a broad range of agendas pioneered by external research leads. Those that work, FAR will double down on and invest more in. This model has had a fair amount of demand already so there's product-market fit, but we still want to iterate and see if we can improve the model. For example, long-term FAR might want to bring some or all research leads in-house.

In terms of concrete agendas, an example of some of the things FAR is working on:

  • Adversarial attacks against narrowly superhuman systems like AlphaGo.
  • Language model benchmarks for value learning.
  • The inverse scaling law prize.

You can read more about us on our launch post.

Load More