Jonas Hallgren — LessWrong

AI Safety person currently working on multi-agent coordination problems.

Hot take:

I've been experiencing more "frame lock-in" with more sophisticated models. I recently experienced it with Sonnet 4.5 so I want to share a prediction that models will grow more "intelligent" (capable of reasoning within frames) whilst they have a harder time with changing frames. There's research on how more intelligent people become better at motivated reasoning and interpreting things within existent frames and it seems like LLMs might inhabit similar biases.

I'm a big heuristics bridging fan as I think that it is to some extent a way to describe a very compressed action-policy based on an existing reward function that has been tested in the past.

So we can think about what you're saying here as a way to learn values to some extent or another. By bridging local heuristics we can find better meta heuristics and also look at what times these heuristics would be optimal. This is why I really like Meaning Alignment Institute's work on this because they have a way of doing this at scale: https://arxiv.org/pdf/2404.10636

I also think that a part of the "third wave" of AI Safety which is more focused on sociotechnical stuff kind of gets around the totalitarian and control heuristics as it's saying it can be solved in a pro-social way? I really enjoyed this post, thanks for writing it!

Firstly, I find it really funny to ask for more specification through an example about something being underspecified and maybe that was the point! :D

If it was not a gag then here's an example based on my interpretation of what #2 is (and I'm happy to be corrected): Imagine that you know that you need to get something done, say you have a deadline on friday and you need to write an essay on a topic like the economics of AI. Yet you don't know where to start, who's the audience, what type of frame should you take, what example do you start with?

The uncertainty of the task makes you want to avoid it since you need to pin it down first, it is an ambigious task.

I think thinking as a self-reflective process can be quite limited. It is at a certain level of coarse graining that is higher (at least for me) than doing something like feeling or pre-cognitive intuitions and tendencies.

So, I'll say the boring thing which is basically meditation could be that cogtech as it allows you to increase the precision of your self-reflective microscope and allows you to see other things than the higher coarse graining of self-reflective thought allows you to see. Now, I'm sure that one still falls for a bunch of failure modes there as well since it can be very hard to see what is wrong with a system from within the system itself. It's just that the mistakes become less coarse grained and that they come from another perspective.

In my own experience there are different states of being, one is from the thinking perspective, another is from a perspective of non-thinking awareness. The thinking perspective thinks it's quite smart and takes things very seriously and the aware perspective sees this and thinks it's quite endearing and the thinking part then takes that in and reflects on that it's ironically ignorant. The thinking part tracks externalities and through the aware part is able to drop it because it finds itself ignorant? I used to only have the thinking part and that created lots of loops and cognitive strain and suffering because I got stuck in certain beliefs?

I think this deep belief of knowing that I'm very cognitively limited in terms of my perspective and frame allows me to hold beliefs about the world and my self a lot more loosely than I was able to hold them before? Life is a lot more vibrant and relaxing as a consequence as it is a lot easier to be wrong and it is actually a delight to be proven wrong. I would say this in the past but I wouldn't emotionally feel it and as I heard someone say "Meditation is the practice of taking what you think into what you feel".

I wanted to ask if you could record it or at least post the transcript after it's done? It would be nice to have. Also, this was cool as I got to understand the ideas more deeply and from a different perspective than Sahil's, I thought it was quite useful especially in how it relates to agency.

Prediction & Warning:

There are lots of people online who have started to pick up the word "clanker" in order to protest against AI systems. This word and sentiment is on the rise and I think that this will be a future schism in the more general anti-AI movement. The warning part here is that I think that the Pause movement and similar can likely get caught up in a general anti AI system speciesism.

Given that we're starting to see more and more agentic AI systems with more continous memory as well as more sophisticated self modelling, the basic foundations for a lot of the existing physicalist theories of consciousness are starting to be fulfilled. Within 3-5 years I find it quite likely that AIs will at least have some sort of basic sentience that we can almost basically prove (given IIT or GNW or another physicalist theory).

This could potentially be one of the largest suffering risks that we've seen that we're potentially inducing on the world. When you're using a word like "clanker", you're essentially demonizing that sort of a system. Right now it's generally fine as it's currently about a sycophantic non-agentic chatbot and so it's fine as an anti measure to some of the existing thoughts of AIs being conscious but it is likely a slippery slope?

More generally, I've seen a bunch of generally kind and smart AI Safety people have quite an anti-AI species sentiment in terms of how to treat these sorts of systems. From my perspective, it feels a bit like it comes from a place of fear and distrust which is completely understandable as we might die if anyone builds a superintelligent AI.

Yet that fear of death shouldn't stop us from treating potential conscious beings kindly?

A lot of racism or similar can be seen as coming from a place of fear, the aryan master race was promoted because of the idea that humanity would go extinct if we got worse genetics into the system. What's the difference from the idea that AIs might share our future lightcone?

The general argument goes that this time it is completely different since the AI can self-replicate, edit it's own software, etc. This is a completely reasonable argument as there's a lot of risks involved with AI systems.

It is when we get to the next part that I see a problem. The argument that follows is: "Therefore, we need to keep the almight humans in control to wisely guide the future of the lightcone."

Yet, there's generally a lot more variance within a distribution of humans compared to variance between distributions.

So when someone says that we need humans to remain in control, I think: "mmm, yes the totally homogenous group of "humans" that don't include people like hitler, polpot and stalin". And for the AI side of things we also have the same: "Mmm, yes the totally homogenous group of "all possible AI systems" that should be kept away so that the "wise humans" can remain in control." Because a malignant RSI system is the only future AI based system that can be thought of, there is no way to change the system so that it values cooperation and there is no other way for future AI development to go than a quick take-off where an evil AI takes over the world.

Yes, there are obviously things that AIs can do that humans can do but don't demonize all possible AI systems as a consequence, it is not black and white. We can protect ourselves against recursively self-improving AI and at the same time respect AI sentience, we can hold at the surface level contradictory statements at the same time?

So let's be very specific about our beliefs and let's make sure that our fear does not guide us into a moral catastrophe whether it be the extinction of all future life on earth nor a capture of sentient beings into a future of slavery?

I wanted to register some predictions and bring this up as I haven't seen that many discussions on it. Finally, politics is war and arguments are soldiers so let's keep it focused on the something object level? If you disagree, please tell me the underlying reasons. Finally in that spirit, here's a set of questions I would want to ask someone who's anti the above sentiment expressed:

How do we deal with potentially sentient AI?
Does respecting AI sentience lead to powerful AI taking over? Why?
What is the story that you see towards that? What are the second and third-order consequences?
How do you imagine our society looking like in the future?
How does a human controlled world look in the future?

I would change my mind if you could argue that there is a better heuristic to use than kindness and respect towards other sentient beings. You need to tit for that with defecting agents, yet why are all AI systems defecting in that case? Why is the cognitive architecture of future AI systems so different that I can't apply the same game theoretical virtue ethics on them as I do to humans? And given the inevitable power-imbalance arguments that I'll get as a consequence of that question, why don't we just aim for a world where we retain power balance between our top-level and bottom-up systems (a nation and an individual for example) in order to retain power-balance between actors?

Essentially, I'm asking for a reason to believe why this story of system level alignment between a group and an individual will be solved by not including future AI systems as part of the moral circle?

Thank you for clarifying, I think I understand now!

I notice I was not that clear when writing my comment yesterday so I want to apologise for that.

I'll give an attempt at restating what you said in other terms. There's a concept of temporal depth in action plans. The question is to some extent, how many steps in the future are you looking similar to something else. A simple way of imagining this is how long in the future a chess bot can plan and how stockfish is able to plan basically 20-40 moves in advance.

It seems similar to what you're talking about here in that the longer someone plans in the future, the more external attempts it avoids with regards to external actions.

Some other words to describe the general vibe might be planned vs unplanned or maybe centralized versus decentralized? Maybe controlled versus uncontrolled? I get the vibe better now though so thanks!

I guess I'm a bit confused why the emergent dynamics and the power-seeking are on different ends of the spectrum?

Like what do you even mean by emergent dynamics there? Are we talking about non-power seeking system, and in that case, what systems are non-power seeking?

I would claim that there is no system that is not power-seeking since any system that survives needs to do bayesian inference and therefore needs to minimize free energy. (Self-referencing here but whatever) hence any surviving system needs to power-seek, given power-seeking is attaining more causal control over the future.

So therefore, there is no future where there is no power-seeking system it is just that the thing that power-seeks acts over larger timespans and is more of a slow actor. The agentic attractor space is just not human flesh bag space nor traditional space, it is different yet still a power seeker.

Still, I do like what you say about the change in the dynamics and how power-seeking is maybe more about a shorter temporal scale? It feels like the y-axis should be that temporal axis instead since it seems to be more what you're actually pointing at?

I was reflecting on some of the takes here for a bit and if I imagine a blind gradient descent in this direction, I imagine quite a lot of potential reality distortion fields due to various of the underlying dynamics involved with holding this position.

So the one thing I wanted to ask was that if you have any sort of reset mechanism here? Like what is the schelling point before the slippery slope? What is the specific action pattern you would take if you got too far? Or do you trust future you enough in order to ensure that it won't happen?

I just want to be annoying and drop a "hey, don't judge a book by it's cover!"

There might be deeper modelling concerns that we've got no clue about, it's weird and is a negative signal but it is often very hard to see second order consequences and similar from a distance!

(I literally know nothing about this situation but I just want to point it out)

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments