Thane Ruthenis

Wiki Contributions


deliberately filtering out simulation hypotheses seems quite difficult, because it's unclear to specify it

Aha, that's the difficulty I was overlooking. Specifically, I didn't consider that the approach under consideration here requires us to formally define how we're filtering them out.


What evidence is there of this?

Nothing decisive one way or another, of course.

  • There's been some success in locating abstract concepts in LLMs, and it's generally clear that their reasoning is mainly operating over "shallow" patterns. They don't keep track of precise details of scenes. They're thinking about e. g. narrative tropes, not low-level details.
    • Granted, that's the abstraction level at which simulacra themselves are modeled, not distributions-of-simulacra. But that already suggests that LLMs are "efficient" simulators, and if so, why would higher-level reasoning be implemented using a different mechanism?
  • Think about how you reason, and what are more and less efficient ways to do that. Like figuring out how to convince someone of something. A detailed, immersive step-by-step simulation isn't it; babble-and-prune isn't it. You start at a highly-abstract level, then drill down, making active choices all the way with regards to what pieces need more or less optimizing.
  • Abstract considerations with regards to computational efficiency. The above just seems like a much more efficient way to run "simulations" than the brute-force way.

This just seems like a better mechanical way to think about it. Same way we decided to think of LLMs as about "simulators", I guess.

Isn't physics a counterexample to this?

No? Physics is a dumb simulation just hitting "next step", which has no idea about the higher-level abstract patterns that emerge from its simple rules. It's wasteful, it's not operating under resource constraints to predict its next step most efficiently, it's not trying to predict a specific scenario, etc.

The shoggoth is not a simulacrum, it's the process by which simulacra are chosen and implemented. It's the thing that "decides", when prompted with some text, that what it "wants" to do is to figure out what simulated situation/character that text corresponds to, and which then figures out what will happen in the simulation next, and what it should output to represent what has happened next.

I suspect that, when people hear "a simulation", they imagine a literal step-by-temporal-step low-level simulation of some process, analogous to running a physics engine forward. You have e. g. the Theory of Everything baked into it, you have some initial conditions, and that's it. The physics engine is "dumb", it has no idea about higher-level abstract objects it's simulating, it's just predicting the next step of subatomic interactions and all the complexity you're witnessing is just emergent.

I think it's an error. It's been recently pointed out with regards to acausal trade — that actually, the detailed simulations people often imagine for one-off acausal deals are ridiculously expensive, and abstract, broad-strokes inference is much cheaper. Such "inference" would also be a simulation in some sense, in that it involves reasoning about the relevant actors and processes and modeling them across time. But it's much more efficient, and, more importantly, it's not dumb. It's guided by a generally intelligent process, which is actively optimizing its model for accuracy, jumping across abstraction layers to improve it. It's not just a brute algorithm hitting "next step".

Same with LLMs. They're "simulators", in the sense that they're modeling a situation and all the relevant actors and factors in it. But they're not dumb physics-engine-style simulations, they're highly sophisticated reasoners that make active choices with regards to what aspects they should simulate in more or less detail, where can they get away with higher-level logic only, etc.

That process is reasoning over whole distributions of possible simulacra, pruning and transforming them in a calculated manner. It's necessarily more complicated/capable than them.

That thing is the shoggoth. It's fundamentally a different type of thing than any given simulacrum. It doesn't have a direct interface to you, you're not going to "talk" to it (as the follow-up tweet points out).

So far, LLMs are not AGI, and the shoggoth is not generally intelligent. It's not going to do anything weird, it'd just stick to reasoning over simulacra distributions. But if LLMs or some other Simulator-type model hits AGI, the shoggoth would necessarily hit AGI as well (since it'd need to be at least as smart as the smartest simulacrum it can model), and then whatever heuristics it has would be re-interpreted as goals/values. We'd thus get a misaligned AGI, and by the way it's implemented, it would be in the direct position to "puppet" any simulacrum it role-plays.

Generative world-models are not especially safe; they're as much an inner alignment risk as any other model.

Yeah, but the random babbling isn't solving the problem here, it's used as random seeds to improve your own thought-generator's ability to explore.  Like, consider cognition as motion through the mental landscape. Once a motion is made in some direction, human minds' negative creativity means that they're biased towards continuing to move in the same direction. There's a very narrow "cone" of possible directions in which we can proceed from a given point, we can't stop and do a turn in an arbitrary direction. LLMs' babble, in this case, is meant to increase the width of that cone by adding entropy to our "cognitive aim", let us make sharper turns.

In this frame, the human is still doing all the work: they're the ones picking the ultimate direction and making the motions, the babble just serves as vague inspiration.

Or maybe all of that is overly abstract nonsense.

The problem is that the AI doesn't a priori know the correct utility function, and whatever process it uses to discover that function is going to be attacked by Mu

I don't understand the issue here. Mu can only interfere with the simulated AI's process of utility-function discovery. If the AI follows the policy of "behave as if I'm outside the simulation", AIs simulated by Mu will, sure, recover tampered utility functions. But AIs instantiated in the non-simulated universe, who deliberately avoid thinking about Mu/who discount simulation hypotheses, should just safely recover the untampered utility function. Mu can't acausally influence you unless you deliberately open a channel to it.

I think I'm missing some part of the picture here. Is it assumed that any process of utility-function discovery has to somehow route through (something like) the unfiltered universal prior? Or that uncertainty with regards to one's utility function means you can't rule out the simulation hypothesis out of the gate, because it might be that what you genuinely care about is the simulators?

Disclaimer: Haven't actually tried this myself yet, naked theorizing.

“We made a wrapper for an LLM so you can use it to babble random ideas!” 

I'd like to offer a steelman of that idea. Humans have negative creativity — it takes conscious effort to come up with novel spins on what you're currently thinking about. An LLM babbling about something vaguely related to your thought process can serve as a source of high-quality noise, noise that is both sufficiently random to spark novel thought processes and relevant enough to prompt novel thoughts on the actual topic you're thinking about (instead of sending you off in a completely random direction). Tools like Loom seem optimized for that.

It's nothing a rubber duck or a human conversation partner can't offer, qualitatively, but it's more stimulating than the former, and is better than the latter in that it doesn't take up another human's time and is always available to babble about what you want.

Not that it'd be a massive boost to productivity, but might lower friction costs on engaging in brainstorming, make it less effortful.

... Or it might degrade your ability to think about the subject matter mechanistically and optimize your ideas in the direction of what sounds like it makes sense semantically. Depends on how seriously you'd be taking the babble, perhaps.

Me: *looks at some examples* “These operationalizations are totally ad-hoc. Whoever put together the fine-tuning dataset didn’t have any idea what a robust operationalization looks like, did they?”

... So maybe we should fund an effort to fine-tune some AI model on a carefully curated dataset of good operationalizations? Not convinced building it would require alignment research expertise specifically, just "good at understanding the philosophy of math" might suffice.

Finding the right operationalization is only partly intuition, partly it's just knowing what sorts of math tools are available. That is, what exists in the concept-space and is already discovered. That part basically requires having a fairly legible high-level mental map of the entire space of mathematics, and building it is very effortful, takes many years, and has very little return on learning any specific piece of math.

At least, it's definitely something I'm bottlenecked on, and IIRC even the Infra-Bayesianism people ended up deriving from scratch a bunch of math that latter turned out to be already known as part of imprecise probability theory. So it may be valuable to get some sort of "intelligent applied-math wiki" that babbles possible operationalizations at you/points you towards math-fields that may have the tools for modeling what you're trying to model.

That said, I broadly agree that the whole "accelerate alignment research via AI tools" doesn't seem very promising, either the Cyborgism or the Conditioning Generative Models directions. Not that I see any fundamental reason why pre-AGI AI tools can't be somehow massively helpful for research — on the contrary, it feels like there ought to be some way to loop them it. But it sure seems trickier than it looks at first or second glance.

Hm, there seems to be two ways the statement "human values are a natural abstraction" could be read:

  1. "Human values" are a simple/convergent feature of the concept-space, such that we can expect many alien civilizations to have a representations for them, and for AIs' preferences to easily fall into that basin.
  2. "Human values" in the sense of "what humans value" — i. e., if you're interacting with the human civilization, the process of understanding that civilization and breaking its model into abstractions will likely involve computing a representation for "whatever humans mean when they say 'human values'".

To draw an analogy, suppose we have an object with some Shape X. If "X = a sphere", we can indeed expect most civilizations to have a concept of it. But if "X = the shape of a human", most aliens would never happen to think about that specific shape on their own. However, any alien/AI that's interacting with the human civilization surely would end up storing a mental shorthand for that shape.

I think (1) is false and (2) is... probably mostly true in the ways that matter. Humans don't have hard-coded utility functions, human minds are very messy, so there may be several valid ways to answer the question of "what does this human value?". Worse yet, every individual human's preferences, if considered in detail, are unique, so even once you decide on what you mean by "a given human's values", there are likely different valid ways of agglomerating them. But hopefully the human usage of those terms isn't too inconsistent, and there's a distinct "correct according to humans" way of thinking about human values. Or at least a short list of such ways.

(1) being false would be bad for proposals of the form "figure out value formation and set up the training loop just so in order to e. g. generate an altruism shard inside the AI". But I think (2) being even broadly true would suffice for retarget-the-search–style proposals.

I think it's mostly right, in the sense that any given novel research artifact produced by Visionary A is unlikely to be useful for whatever research is currently pursued by Visionary B. But I think there's a more diffuse speed-up effect from scale, based on the following already happening:

the intuitions that lead one to a solution might be the sort of thing that you can only see if you've been raised with the memes generated by the partial-successes and failures of failed research pathways

The one thing all the different visionaries pushing in different directions do accomplish is mapping out the problem domain. If you're just prompted with the string "ML research is an existential threat", and you know nothing else about the topic, there's a plethora of obvious-at-first-glance lines of inquiry you can go down. Would prosaic alignment somehow not work, and if yes, why? How difficult would it be to interpret a ML model's internals? Can we prevent a ML model from becoming an agent? Is there some really easy hack to sidestep the problem? Would intelligence scale so sharply that the first AGI failure kills us all? If all you have to start with is just "ML research is an existential threat", all of these look... maybe not equally plausible, but not like something you can dismiss without at least glancing in that direction. And each glance takes up time.

On the other hand, if you're entering the field late, after other people have looked in these directions already, surveying the problem landscape is as easy as consuming their research artifacts. Maybe you disagree with some of them, but you can at least see the general shape of the thing, and every additional bit of research clarifies that shape even further. Said "clarity" allows you to better evaluate the problem, and even if you end up disagreeing with everyone else's priorities, the clearer the vision, the better you should be able to triangulate your own path.

So every bit of research probabilistically decreases the "distance" between the solution and the point at which a new visionary starts. Orrr, maybe not decreases the distance, but allows a new visionary to plot a path that looks less like a random walk and more like a straight line.

Load More