All of berglund's Comments + Replies

The ELK report has a section called "Indirect normativity: defining a utility function" that sketches out a proposal for using ELK to help align AI. Here's an excerpt:

Suppose that ELK was solved, and we could train AIs to answer unambiguous human-comprehensible questions about the consequences of their actions. How could we actually use this to guide a powerful AI’s behavior? For example, how could we use it to select amongst many possible actions that an AI could take?

The natural approach is to ask our AI “How good are the consequences of action A?” but t

... (read more)

Predicate: given a program, does it halt?

Generation problem: generate a program which halts.

Verification problem: given a program, verify that it halts.

In this example, I'm not sure that generation is easier than verification in the most relevant sense. 

In AI alignment, we have to verify that some instance of a class satisfies some predicate. E.g. is this neural network aligned, is this plan safe, etc. But we don't have to come up with an algorithm that verifies if an arbitrary example satisfies some predicate. 

The above example is only relevant ... (read more)

It does clarify what you are talking about, thank you.

Now it's your use of "intolerable" that I don't like. I think most people could kick a coffee addiction if they were given enough incentive, so withdrawal is not strictly intolerable. If every feeling that people take actions to avoid is "intolerable", then the word loses a lot of its meaning. I think "unpleasant" is a better word. (Also, the reason people get addicted to caffeine in the first place isn't the withdrawal, but more that it alleviates tiredness, which is even less "intolerable.")

Your phras... (read more)

I struggled at first to see the analogy being made to AI here. In case it helps others, here is my interpretation:

  • Near-future (or current ?) LLMs are the planes here, humans are the birds.
  • These LLMs will soon be able to perform many of the most important cognitive functions that humans can do. In the analogy, these are the flying-related functions that planes perform.
  • As with current LLMs, there will always be certain tasks that humans are better at, such as motor control or humor. That's because humans are highly specialized for certain tasks that aren't a
... (read more)

For instance, ~1 billion people worldwide are addicted to caffeine. I think that's just what happens when a person regularly consumes coffee. It has nothing to do with some intolerable sensation.

I'm guessing we're using the word "addiction" differently. I don't deny that there's a biological adaptation going on. Caffeine inhibits adenosine, prompting the body to grow more adenosine receptors. And stopping caffeine intake means the adenosine hit is more intense & it takes a while for the body to decide to break down some of those extra receptors. (Or something like that; I'm drudging up memories of things I was told years ago about the biochemistry of caffeine.) But here's the thing: Why does that prompt someone to reach for coffee? "It's a habit" doesn't cut it. If the person decides to stop caffeine intake and gets it out of their house, they might find themselves rationalizing a visit to Starbucks. There's an intelligent process that is agentically aiming for something here. There's nothing wrong with feeling tired, sluggish, etc. You have to make it wrong. Make it mean something — like "Oh no, I won't be productive if I don't fix this!" This is the "intolerable" part. Intolerability isn't intrinsic to a sensation. It's actually about how we relate to a sensation. I've gone through caffeine withdrawal several times. Drudged through the feelings of depression, lethargy, inadequacy, etc. But with the tone of facing them. Really feeling them. It takes me just three days to biologically adapt to caffeine, so I've done this quite a few times now. But I actually dissolved the temptation to stay hooked. Now I just use caffeine very occasionally, and if it becomes important to do for a few days in a row… I just go through the withdrawal right afterwards. It's not a big deal. Which is to say, I've dissolved the addiction, even though I can still biologically adapt to it just like anyone else. I would say I'm not addicted to it even when I do get into an adaptive state with it. Does that clarify what I'm talking about for you?

This is the basic core of addiction. Addictions are when there's an intolerable sensation but you find a way to bear its presence without addressing its cause. The more that distraction becomes a habit, the more that's the thing you automatically turn to when the sensation arises. This dynamic becomes desperate and life-destroying to the extent that it triggers a red queen race.

I doubt that addiction requires some intolerable sensation that you need to drown out. I'm pretty confident its mostly a habits/feedback loops and sometimes physical dependence. 

For instance, ~1 billion people worldwide are addicted to caffeine. I think that's just what happens when a person regularly consumes coffee. It has nothing to do with some intolerable sensation.

Relevant: In What 2026 Looks Like, Daniel Kokotajlo predicted expert level Diplomacy play would be reached in 2025.


Another major milestone! After years of tinkering and incremental progress, AIs can now play Diplomacy as well as human experts. [...]

I'm mentioning this, not to discredit Daniel's prediction, but to point out that this seems like capabilities progress ahead of what some expected.

Continuing the quote:

... It turns out that with some tweaks to the architecture, you can take a giant pre-trained multimodal transformer and then use it as a component in a larger system, a bureaucracy but with lots of learned neural net components instead of pure prompt programming, and then fine-tune the whole system via RL to get good at tasks in a sort of agentic way.

Worth noting that Meta did not do this: they took many small models (some with LM pretraining) and composed them in a specialized way. It's definitely faster than what Daniel said in his p... (read more)

Here's why I personally think solving AI alignment is more effective than generally slowing tech progress

  • If we had aligned AGI and coordinated in using it for the right purposes, we could use it to make the world less vulnerable to other technologies
  • It's hard to slow down technological progress in general and easier to steer the development of a single technology, namely AGI
  • Engineered pandemics and nuclear war are very unlikely to lead to unrecoverable societal collapse if they happen (see this report) whereas AGI seems relatively likely (>1% chance)
  • Oth
... (read more)

I think the issue might be that the ELK head (the system responsible for eliciting another system's latent knowledge) might itself be deceptively aligned. So if we don't solve deceptive alignment our ELK head won't be reliable.

Thanks for writing this! I would also add the CHAI internship to the list of programs within the AI safety community. 

Google Search getting worse every year? Blame, or complain to, Danny Sullivan.

(Also, yes. Yes it is.)

I actually have the opposite impression. I feel like Google has gotten a lot better through things like personalized results and  that feature where they extract text that is relevant to the question you searched for. Can you or somebody else explain why it's getting worse? 

4clone of saturn3mo
Remember when Google Shopping used to be an actual search index of pretty much every online store? You could effortlessly find even the most obscure products and comparison shop between literally thousands of sellers. Then one day they decided to make it pay-to-play and put advertisers in control of what appears on there. Now it's pretty much useless to me. I think a similar process has happened with Search, just more gradually. Your experience with it probably has a lot to do with how well your tastes and preferences happen to align with what advertisers want to steer people toward.

I have the strong impression that Google search has become nigh uselessly bad, based on tons of experiences where it feels like my nuanced search query gets rounded off to a generic, common, and to me irrelevant question. And in particular, Google loves to answer each query under the assumption that I want to buy something, when I actually want to know something (and rarely buy stuff, so one would expect search personalization to help me).

For instance, when I search for how to find clothes for tall & thin people, Google answers with random Amazon produ... (read more)

Posted three examples that I thought to screencap when they occurred, but if I tracked everything there would be hundreds. []

I found some behaviors, but I'm not sure this is what you are looking for because the algorithm in both is quite simple. I'd appreciate feedback on them.

Incrementing days of the week

"If today is Monday, tomorrow is Tuesday. If today is Wednesday, tomorrow is" -> "Thursday" 

"If today is Monday, tomorrow is Tuesday. If today is Thursday, tomorrow is" -> "Friday" 


This also works with zero-shot prompting although the effect isn't as strong. eg:

"If today is Friday, tomorrow is" -> "Saturday"

Inferring gender

"Lisa is great. I really like" -> "her"

"John is great. I really like" -> "him"


How does shard theory explain romantic jealousy? It seems like most people feel jealous when their romantic partner does things like dancing with someone else or laughing at their jokes. How do shards like this form from simple reward circuitry? I'm having trouble coming up with a good story of how this happens. I would appreciate if someone could sketch one out for me.

I don't know. Speculatively, jealousy responses/worries could be downstream of imitation/culture (which "raises the hypothesis"/has self-supervised learning ingrain the completion, such that now the cached completion is a consequence which can be easily hit by credit assignment / upweighted into a real shard). Another source would be negative reward events on outcomes where you end up alone / their attentions stray. Which, itself, isn't from simple reward circuitry, but a generalization of other learned reward events which I expect are themselves downstream of simple reward circuitry. (Not that that reduces my confusion much)

Might be too late now, but is it worth editing the post to include the canary string to prevent use in LM training, just like this post does? 

I think that's the antimeme from the Dying with Dignity post. If I remember correctly, the MIRI dialogues between Paul and Eliezer were about takeoff speeds, so Connor is probably referring to something else in the section I quoted, no?

I appreciate the post and Connor for sharing his views, but the antimeme thing kind of bothers me.

  • Here’s my hot take: I think Paul and Eliezer were having two totally different conversations. Paul was trying to have a scientific conversation. Eliezer was trying to convey an antimeme.
  • An antimeme is something that by its very nature resists being known. Most antimemes are just boring—things you forget about. If you tell someone an antimeme, it bounces off them. So they need to be communicated in a special way. Moral intuitions. Truths about yourself. A psych
... (read more)
7Tomás B.4mo
I suppose if it’s an a antimeme, I may be not understanding. But this was my understanding: Most humans are really bad at being strict consequentialists. In this case, they think of some crazy scheme to slow down capabilities that seems sufficiently hardcore to signal that they are TAKING SHIT SERIOUSLY and ignore second order effects that EY/Connor consider obvious. Anyone whose consequentialism has taken them to this place is not a competent one. EY proposes such people (which I think he takes to mean everyone, possibly even including himself) follow a deontological rule instead, attempt to die with dignity. Connor analogizes this to reward shaping - the practice of assigning partial credit to RL agents for actions likely to be useful in reaching the true goal.
I also found this part confusing.

Why does it no longer hold in a low path-dependence world?

Not sure, but here's how I understand it:

If we are in a low path-dependence world, the fact that SGD takes a certain path doesn't say much about what type of model it will eventually converge to. 

In a low path-dependence world, if these steps occurred to produce a deceptive model, SGD could still "find a path" to the corrigibly aligned version. The questions of whether it would find these other models depends on things like "how big is the space of models with this property?", which corresponds to a complexity bias.

Even if, as you claim, the models that generalize best aren't the fastest models, wouldn't there still be a bias toward speed among the models that generalize well, simply because computation is limited, so faster models can do more with the same amount of computation? It seems to me that the scarcity of compute always results in a speed prior.

I felt like this post could benefit from a summary so I wrote one below. It ended up being pretty long, so if people think it's useful I could make it into it's own top-level post.

Summary of the summary

In this talk Evan examines how likely we are to get deceptively aligned models once our models become powerful enough to understand the training process. Since deceptively aligned models are behaviorally (almost) indistinguishable from robustly aligned model, we should examine this question by looking at the inductive biases of the training process. The talk... (read more)

Thanks for the post! One thing that confuses me as that you call this sort of specific alignment plan "the shard theory alignment scheme," which seems to imply that if you believe in shard theory then this is the only way forward. It seems like shard theory could be mostly right but a different alignment scheme ends up working. 

Did you mean to say something more like "the shard theory team's alignment scheme", or are you actually claiming that this is the main alignment scheme implied by shard theory? If so, why?

Thanks for posting this. 

I'm a bit confused here, when you talk about the Hessian are you talking about the Hessian evaluated at the point of minimum loss? If so, isn't the bellow statement not strictly right?

If we start at our minimum and walk away in a principal direction, the loss as a function of distance traveled is , where  is the Hessian eigenvalue for that direction.

Like, isn't  just an approximation of the loss here?

2Vivek Hebbar5mo
Yes, it is an approximation, as noted at the start of that section:

The way I see it having a lower level understanding of things allows you to create abstractions about their behavior that you can use to understand them on a higher level. For example, if you understand how transistors work on a lower level you can abstract away their behavior and more efficiently examine how they wire together to create memory and processor. This is why I believe that a circuits-style approach is the most promising one we have for interpretability.  

Do you agree that a lower level understanding of things is often the best way to achieve a higher level understanding, in particular regarding neural network interpretability, or would you advocate for a different approach?

Another reason to not expect the selection argument to work is that it’s instrumentally convergent for most inner agent values to not become wireheaders, for them to not try hitting the reward button. 

To me this implies that as the AI becomes more situationally aware it learns to avoid rewards that reinforce away its current goals (because it wants to preserve its goals.) As a result, throughout the training process, the AIs goals start out malleable and "harden" once the AI gains enough situational awareness. This implies that goals have to be simple enough for the agent to be able to model them early on in its training process. 

I was a bit confused about this quote, so I tried to expand on the ideas a bit. I'm posting it here in case anyone benefits from is or disagrees.

To which I say: I expect many of the cognitive gains to come from elsewhere, much as a huge number of the modern capabilities of humans are encoded in their culture and their textbooks rather than in their genomes. Because there are slopes in capabilities-space that an intelligence can snowball down, picking up lots of cognitive gains, but not alignment, along the way.

I guess saying is saying that an AI will devel... (read more)

This is the same flawed approach that airport security has, which is why travelers still have to remove shoes and surrender liquids: they are creating blacklists instead of addressing the fundamentals.

Just curious,  what would it look like to "address the fundamentals" in airport security?

Obviously this can't be answered with justice in a single comment, but here are some broad pointers that might help see the shape of the solution:

  • Israeli airport security focuses on behavioral cues, asking unpredictable questions, and profiling. A somewhat extreme threat model there, with much different base rates to account for (but also much lower traffic volume).
  • Reinforced cockpit doors address the hijackers with guns and knives scenarios, but are a fully general kind of a no-brainer control.
  • Good policework and better coordination in law enforcement are
... (read more)
2Rob Bensinger7mo
Thanks! Edited.

2. There’s lots of Minecraft videos on YouTube, so you could test a “GPT-3 for Minecraft” approach.

OpenAI just did this exact thing.

3Rohin Shah7mo
There's some chance that this was because I talked about it with OpenAI / MineRL people, but overall I think it's much more likely to be "different people independently came up with good projects". Looking at the results I'm even more bullish than I already was about Minecraft as a good "alignment testbed".

It's worth emphasizing your point about the negative consequences of merely aiming for a pivotal act.

Additionally, if a lot of people in the AI safety community advocate for a pivotal act, it makes people less likely to cooperate with and trust that community.  If we want to make AGI safe, we have to be able to actually influence the development of AGI. To do that, we need to build a cooperative relationship with decision makers. Planning a pivotal act runs counter to these efforts.

Hear, hear!

Can someone clarify what "k>1" refers to in this context? Like, what does k denote?

This is an expression from Eliezer's Intelligence Explosion Microeconomics. In this context, we imagine an AI making some improvement to its own operation, and then k is the number of new improvements which it is able to find and implement. If k>1, then each improvement allows the AI to make more new improvements, and we imagine the quality of the system growing exponentially.

It's intended as a simplified model, but I think it simplifies too far to be meaningful in practice. Even very weak systems can be built with "k > 1," the interesting question will always be about timescales---how long does it take a system to make what kind of improvement?

The link doesn't work. I think you are linking to a draft version of the post or something.

Another possibility is that the machine does not in fact attack humans because it simply does not want to, does not need it. I am not that convinced by the instrumental convergence principle, and we are a good negative example: We are very powerful and extremely disruptive to a lot of life beings, but we haven't taken every atom on earth to make serotonin machines to connect our brains to.

Not yet, at least. 

I mostly agree with that relying on real world data is necessary for better understanding our messy world and that in most cases this approach is favorable. 

There's a part of me that thinks AI is a different case though, since getting it even slightly wrong will be catastrophic. Experimental alignment research might get us most of the way to aligned AI, but there will probably still be issues that aren't noticeable because the AIs we are experimenting on won't be powerful enough to reveal them. Our solution to the alignment problem can't be something ... (read more)

I agree that, given MIRI's model of AGI emergence, getting it slightly wrong would be catastrophic. But that's my whole point: experimenting early is strictly better than not, because it reduces the odds of getting some big wrong, as opposed to something small along the way. I had mentioned in another post that [] so that there are no "immense optimization pressures". I think that's what Eliezer says, as well, hence his pessimism and focus on "dying with dignity". But we won't know if this intuition is correct without actually testing it experimentally and repeatedly. It might not help because "there is no fire alarm for superintelligence", but the alternative is strictly worse, because the problem is so complex.

I had a similar thought.  Also, in an expected value context it makes sense to pursue actions that succeed when your model is wrong and you are actually closer to the middle of the success curve, because if that's the case you can increase our chances of survival more easily. In the logarithmic context doing so doesn't make much sense, since your impact on the logistic odds is the same no matter where on the success curve you are. 

Maybe this objective function (and the whole ethos of Death with Dignity) is way to justify working on alignment even if you think our chances of success are close to zero. Personally, I'm not compelled by it.

Another difference is the geographic location! As someone who grew up in Germany, living in England is a lot more attractive to me since it will allow me to be closer to my family. Others might feel similarly.

I see, thanks for answering. To further clarify, given the reporter's only access to the human's nodes is through the human's answers, would it be equally likely for the reporter to create a mapping to some other Bayes net that is similarly consistent with the answers provided? Is there a reason why the reporter would map to the human's Bayes net in particular?

3Mark Xu1y
The dataset is generated with the human bayes net, so it's sufficient to map to the human bayes net. There is, of course, an infinite set of "human" simulators that use slightly different bayes nets that give the same answers on the training set.

Potentially silly question: 

In the first counterexample you describe the desired behavior as 

Intuitively, we expect each node in the human Bayes net to correspond to a function of the predictor’s Bayes net. We’d want the reporter to simply apply the relevant functions from subsets of nodes in the predictor's Bayes net to each node in the human Bayes net [...]

After applying these functions, the reporter can answer questions using whatever subset of nodes the human would have used to answer that question.

Why doesn't the reporter skip the step of ma... (read more)

3Mark Xu1y
We generally imagine that it’s impossible to map the predictors net directly to an answer because the predictor is thinking in terms of different concepts, so it has to map to the humans nodes first in order to answer human questions about diamonds and such.

I think many people here are already familiar with the circuits line of research at OpenAI. Though I think it's now mostly been abandoned

I wasn’t aware that the circuits approached was abandoned. Do you know why they abandoned it?

7Tom Lieberum1y
I can only speculate, but the main researchers are now working on other stuff, like e.g. Anthropic. As to why they switched, I don't know. Maybe they were not making progress fast enough or Anthropic's mission seemed more important? However, at least Chris Olah believes this is still a tractable and important direction, see the recent RFP by him for Open Phil [].

It seems pretty obvious to me that what "slow motion doom" looks like in this sense is a period during which an AI fully conceals any overt hostile actions while driving its probability of success once it makes its move from 90% to 99% to 99.9999%, until any further achievable decrements in probability are so tiny as to be dominated by the number of distant galaxies going over the horizon conditional on further delays.

Wouldn't another consideration be that the AI is more likely to be caught the longer it prepares? Or is this chance negligible since the AI could just execute its plan the moment people try to prevent it?

Something similar came up in the post: Though rereading it, it's not addressing your exact question.

The "collusion" issue leads to a state of affairs that two political groups can gain more political power if they can organize and get along well enough to actively coordinate. Why should two groups have more power just because they can cooperate?

The impression I got, was that collusion between likeminded people created an "indirect democracy" where causes supported by the most people could most efficiently advocate their position. If that is the case, then this system does punish parties that are less willing (or able) to cooperate, which could feasibly be a bad thing, if it means that unpopular results occur because one side is less nuanced on it's position (e.g. a 40% group beats three 20% groups who cannot cooperate). One way around this, maybe, is a Negative Vote (allowing a united method opposition), but that has foreseeable issues, especially if Negative Voting is as efficient or more efficient than Positive.
1[comment deleted]1y

[I may be generalizing here and I don't know if this has been said before.]

It seems to me that Eliezer's models are a lot more specific than people like Richard's. While Richard may put some credence on superhuman AI being "consequentialist" by default, Eliezer has certain beliefs about intelligence that make it extremely likely in his mind. 

I think Eliezer's style of reasoning which relies on specific, thought-out models of AI makes him more pessimistic than others in EA. Others believe there are many ways that AGI scenarios could play out and are ge... (read more)

FYI, the link at the top of the post isn't working for me.

Fixed it. Looks like it was going to the edit-form version of the post on the EA Forum, which of course nobody but Ozzie has permission to see.