somewhere deep, something lurks


Sorted by New

Wiki Contributions


Thanks for making this! 

I'm wondering if you've spent time engaging with any of Michael Levin's work (here's a presentation he gave for the PIBBS 2022 speaker series)? He often talks about intelligence at varying scales/levels of abstractions composing and optimizing in different spaces. He says things like "there is no hard/magic dividing line between when something is intelligent or not,". I think you might find his thinking on the subject valuable.

You might also find Designing Ecosystems of Intelligence from First Principles and The Markov blankets of life: autonomy, active inference and the free energy principle + the sources they draw from relevant. 

How similar is Watanabe's formulation of free energy to Karl Friston's
Generative models for sequential dynamics in active inference | Cognitive  Neurodynamics

Another potential idea:

27. A paper which does for the sharp left turn what the goal misgeneralization paper does for inner alignment (or, at least, breaking the SLT into sub-problems and making a paper for one of the sub-problems)

it does seem to be part of the situation we’re in

Maybe - I can see it being spun in two ways:

  1. The AI safety/alignment crowd was irrationally terrified of chatbots/current AI, forced everyone to pause, and then, unsurprisingly, didn't find anything scary
  2. The AI safety/alignment crowd need time to catch up their alignment techniques to keep up with the current models before things get dangerous in the future, and they did that

To point (1): alignment researchers aren't terrified of GPT-4 taking over the world, wouldn't agree to this characterization, and are not communicating this to others. I don't expect this is how things will be interpreted if people are being fair.

I think (2) is the realistic spin, and could go wrong reputationally (like in the examples you showed) if there's no interesting scientific alignment progress made in the pause-period. 
I don't expect there to be a lack of interesting progress, though. There's plenty of unexplored work in interpretability alone that could provide many low-hanging fruit results. This is something I naturally expect out of a young field with a huge space of unexplored empirical and theoretical questions. If there's plenty of alignment research output during that time, then I'm not sure the pause will really be seen as a failure. 

I’m in favor of interventions to try to change that aspect of our situation

Yeah, agree. I'd say one of the best ways to do this is to make it clear what the purpose of the pause is and defining what counts as the pause being a success (e.g. significant research output).

Also, your pro-pause points seem quite important, in my opinion, and outweigh the 'reputational risks' by a lot:

  • Pro-pause: It’s “practice for later”, “policy wins beget policy wins”, etc., so it will be easier next time
  • Pro-pause: Needless to say, maybe I’m wrong and LLMs won’t plateau!

I'd honestly find it a bit surprising if the reaction to this was to ignore future coordination for AI safety with a high probability. "Pausing to catch up alignment work" doesn't seem like the kind of thing which leads the world to think "AI can never be existentially dangerous" and results in future coordination being harder. If AI keeps being more impressive than the SOTA now, I'm not really sure risk concerns will easily go away.

For me, the balance of considerations is that pause in scaling up LLMs will probably lead to more algorithmic progress

I'd consider this to be one of the more convincing reasons to be hesitant about a pause (as opposed to the 'crying wolf' argument, which seems to me like a dangerous way to think about coordinating on AI safety?). 

I don't have a good model for how much serious effort is currently going into algorithmic progress, so I can't say anything confidently there - but I would guess there's plenty and it's just not talked about? 

It might be a question about which of the following two you think will most likely result in a dangerous new paradigm faster (assuming LLMs aren't the dangerous paradigm):

  1. current amount of effort put into algorithmic progress + amplified by code assistants, apps, tools, research-assistants, etc.
  2. counterfactual amount of effort put into algorithmic progress if a pause happens on scaling

I think I'm leaning towards (1) bringing about a dangerous new paradigm faster because

  • I don't think the counterfactual amount of effort on algorithmic progress will be that much more significant than the current efforts (pretty uncertain on this, though)
  • I'm weary of adding faster feedback loops to technological progress/allowing avenues for meta-optimizations to humanity since these can compound

I had a potential disagreement with your claim that a pause is probably counterproductive if there's a paradigm change required to reach AGI: even if the algorithms of the current paradigm aren't directly a part of the algorithm behind existentially dangerous AGI, advances in these algorithms will massively speed up research and progress towards this goal.

My take is: a “pause” in training unprecedentedly large ML models is probably good if TAI will look like (A-B), maybe good if TAI will look like (C), and probably counterproductive if TAI will be outside (C).


The obvious follow-up question is: “OK then how do we intervene to slow down algorithmic progress towards TAI?” The most important thing IMO is to keep TAI-relevant algorithmic insights and tooling out of the public domain (arxiv, github, NeurIPS, etc.).

This seems slightly contradictory to me. It seems like - whether or not the current paradigm results in TAI, it is certainly going to make it easier to code faster, communicate ideas faster, write faster, and research faster - which would potentially make all sorts of progress, including algorithmic insights, come sooner. A pause to the current paradigm would thus be neutral or net positive if it means ensuring progress isn't sped up by more powerful models

We don't need to solve all of philosophy and morality, it would be sufficient to have the AI system to leave us in control and respect our preferences where they are clear

I agree that we don't need to solve philosophy/morality if we could at least pin down things like corrigibility, but humans may poorly understand "leaving humans in control" and "respecting human preferences" such that optimizing for human abstractions of these concepts could be unsafe (this belief isn't that strongly held, I'm just considering some exotic scenarios where humans are technically 'in control' according to the specification we thought of, but the consequences are negative nonetheless, normal goodharting failure mode).

Which of the two (or the innumerable other possibilities) happens?

Depending on the work you're asking the AI(s) to do (e.g. automating large parts of open ended software projects, automating large portions of STEM work), I'd say the world takeover/power-seeking/recursive self improvement type of scenarios happen since these tasks incentivize the development of unbounded behaviors (because open-ended, project based work doesn't have clear deadlines, may require multiple retries, and has lots of uncertainty, I can imagine unbounded behaviors like "gain more resources because that's broadly useful under uncertainty" to be strongly selected for).

I'd be interested in hearing more about what Rohin means when he says:

... it’s really just “we notice when they do bad stuff and the easiest way for gradient descent to deal with this is for the AI system to be motivated to do good stuff”.

This sounds something like gradient descent retargeting the search for you because it's the simplest thing to do when there are already existing abstractions for the "good stuff" (e.g. if there already exists a crisp abstraction for something like 'helpfulness', and we punish unhelpful behaviors, it could potentially be 'really easy' for gradient descent to simply use the existing crisp abstraction of 'helpfulness' to do much better at the tasks we give it). 

I think this might be plausible, but a problem I anticipate is that the abstractions for things we "actually want" don't match the learned abstractions that end up being formed in future models and you face what's essentially a classic outer alignment failure (see Leo Gao's 'Alignment Stream of Thought' post on human abstractions). I see this happening for two reasons:

  1. Our understanding of what we actually want is poor, such that we wouldn't want to optimize for how we understand what we want
  2. We poorly express our understanding of what we actually want in the data we train our models with, such that we wouldn't want to optimize for the expression of what we want 

I've used the same terms (horizontal and vertical generality) to refer to (what I think) are different concepts than what's discussed here, but wanted to share my versions of these terms in case there's any parallels you see

Horizontal generality: An intelligence's ability to take knowledge/information learned from an observation/experience solving a problem and use it to solve other similarly-structured/isomorphic problems (e.g. a human notices that a problem in finding optimal routing can be essentially mapped to a graph theory problem and solving one solves the other) 

Vertical generality: An intelligence's ability to use their existing knowledge to augment their own intelligence with tools or by successfully designing smarter agents aligned to it (e.g. a human is struggling with solving problems in quantum mechanics, and no amount of effort is helping them. They find an alternative route to solving these problems by learning how to create aligned superintelligence which helps them solve the problems)


If you're an intelligence solving problems, increasing horizontal generality helps because it lets you see how problems you've already solved actually apply to problems you didn't think they applied to before you increased horizontal generality. Increasing vertical generality helps because it finds an alternative route to solving the problem by actually increasing your effective problem solving ability.

Load More