Discord: LemonUniverse (.lemonuniverse). Reddit: u/Smack-works. About my situation: here.

I wrote some bad posts before 2024 because I was very uncertain how the events may develop.

I do philosophical/conceptual research, have no mathematical or programming skills. But I do know a bunch of mathematical and computer science concepts.

I don't buy 4.1. I believe the laws of the universe should be simple, but simplicity of specifying the initial state is not an argument for me (unless there are additional factors). Don't buy 4.2 either, because I don't trust anthropic reasoning that much. However, I'm still interested in ideas like this, even if I don't believe in them.

4.3. I don't know enough math, but I don't believe that a) we have the math to prove that an agent can't survive in an environment with incompressible low-level dynamics or b) it means anything non-trivial enough to prove the kind of points you need to prove. It would be a magically good result.

In a chaotic environment there could be some high-level patterns which the agent might exploit, but those high-level patterns could be barely similar to humanlike abstractions. The Gooder Regulator theorem could be false, maybe, if we don't need to be prepared for "any possible game".

In other words: we have a probability distribution over sequences of transformations of the data we should attempt to try, and we keep sampling from it until we spot a familiar abstraction; until we find a way of looking at the data under which it resolves into a known, simple pattern.

Maybe you're focused on the wrong thing. Maybe the explicit search for the right transformation happens only when things go wrong (and even then a lot of the search is not explicit).

You could have a multi-step detection method which deduces the right transformation instead of searching for it explicitly. Contrast two algorithms:

I slide and scale the dog template across an image until I find a good fit.
I detect the most salient parts of the image, combine those parts into lines, connect the lines, check if the lines have the right proportions. If they do, it's a dog. If they don't, it's unlikely there are any dogs in the image.

But if it's not an abstraction over the cells, then over... what? There aren't really any other elements in the picture! Maybe it's an abstraction over sets of cells, with cells being "subvariables" over which the glider is synergistic? But how exactly do we define those sets?

I think you should filter out sets which are naturally "salient" and then do something with them. For example, imagine I have a single glider in an empty world. I model the world for 1000 timesteps. I pick the set of cells which changed values during this time (that would be an extremely natural thing to do). This way I find at least the "trail" of the glider. By doing other natural things I can narrow it down farther or make it work in a world with many objects.

...

All of the above shouldn't be groundbreaking, I'm just surprised you didn't mention it at all and focused your examples on explicit search. Maybe you're unsure if the trick with naturally salient subsets generalizes?

Writing a relatively high-effort comment because I'm thinking about similar things and intend to turn what I wrote about here into a top-level post.

First I'll write some criticism, then mention some "leads" you might be interested in. For context, I started writing this comment before this exchange.

Criticism/Confusion

The High-Level Outline confused me. Feels like the argument you're making is a bit tricky, yet you don't spell it out explicitly enough. I'm not sure what assumptions are made.

Here's my train of thought:

The universe has some structure (S) which makes human abstractions useful. But human abstractions != S, so human abstractions can still be hard to specify. You seem to be 100% aware of that.
...But then you connect human abstractions to the shortest description of the universe. Which doesn't follow. ...Or maybe you merely say that any short description of the universe is isomorphic to S, therefore human abstractions should be useful for understanding it. ...You also introduce some amount of tautology here, by defining complexity in terms of (a subset of) human abstractions.
My best interpretation is that you're making the argument that we can abstract the way in which human abstractions are interpretable (human abstractions are interpretable by being "symbolic" / "chunky") and then search for any symbolic/chunky abstractions. You assume that the best ways to compress S gonna be symbolic/chunky. Or that we can enforce looking for symbolic/chunky representations of S. ...But then the section about ontology shifts doesn't make sense. Why are there multiple kinds of symbolicness/chunkiness?
There might be a general disagreement or misunderstanding. I would classify programs into 3 types: big and chaotic (A), small and chaotic (B), not chaotic (C). For example, neural networks are big & chaotic and Busy Beavers or pseudorandom number generators or stuff like Van Eck's sequence are small & chaotic. The existence of B-type programs is the biggest problem in capturing the intuitive notion of "complexity" (description length doesn't cut it = some short programs behave very weirdly). If the universe can be compressed by a B-type program, then the compression won't be "chunky" or truly "symbolic". So the core difficulty of formalizing chunkiness/symbolicness is excluding B-type programs. You mention this concern, but dismiss it as not a big deal.

...

What are the most substantial differences between your agenda and NAH?

Definition of a "chunk" (Lead 1)

Human world-models rely on chunking. To understand a complex phenomenon, we break it down into parts, understand the parts individually, then understand the whole in terms of the parts. (The human biology in terms of cells/tissues/organs, the economy in terms of various actors and forces, a complex codebase in terms of individual functions and modules.)

If you split a thing enough times, you'll end up with simple parts. But that doesn't mean that the whole thing is simple, even if it's small. Small Turing Machines can have very complex behavior. I believe it's the core problem of defining what a "chunk" is. Or what "well-abstracting" means. And I think the solution should have the following form:

"Parts of a program X are non-trivial (= can be considered 'chunks' or 'symbols') if they are substantially simpler than X and allow to prove important things about X by using a predetermined method (= a method not requiring a stroke of creativity or meticulous search through proofs) substantially faster than running X."^[1]

Examples:

Consider an algorithm which solves chess positions through exhaustive brute-force search. The operation of expanding the game tree is a non-trivial part of the algorithm, because it allows to easily prove important things about the algorithm (e.g. its runtime) faster than running it. Just by applying induction. However, if I give you the rules of a random Turing Machine, you might be unable to prove anything important by applying generic proof methods to those rules. Which would mean that those rules don't split the program into "digestible" parts — your predetermined proof machinery can't digest them.
By splitting the world into big areas which are assumed to be affecting each other slowly and locally, we can easily prove important things about the world.
Once you know what "organs" are, it allows you to prove extremely non-trivial things about animal bodies. E.g. "deep stab has a high chance to stop an animal forever (because it has a high chance to hit an organ)".

You can take a variable X (e.g. "a particular human") and measure how useful all other variables are for easily proving important things about X. You'll get a "semantic field" of X. The field will include mesavariables (e.g. "organs", "cells") and metavariables (e.g. "democracy", "social norms"). The field will fade away at overly fine-grained variables (e.g. individual atoms) and overly coarse-grained variables.

The structure of abstractions (Lead 2)

Maybe we can deduce something about the structure of abstractions from first principles, in a way completely agnostic to the fundamental building blocks of the universe. Consider this argument:

Assume we model the world as programs modifying themselves and each other.
We have a finite set (S) of abstractions which capture most important properties of the programs' states. Those abstractions are useful for predicting the future. Since the world is predictable, not all abstractions are equally likely (≈ not all future states of programs are equally likely).
Any output of a program can serve as input for another program. Programs can modify each other in the middle of execution. This has two consequences. First: most combinations of abstractions should map to S (i.e. simplify into base abstractions). Second: any predictable program should be modified by many programs in the same way.

The latter means that the semantic field of a predictable variable (X) should have self-similar / redundant structure. A couple of silly examples in the spoiler.

Big forest stone. It doesn't move itself; animals don't try to move it; the landscape around it doesn't move. So, you could say the stone maximizes "avoidance of movement" in different ways.

Lake. When animals dive into the lake, they displace water (create "holes" in the body of water), but said water comes back; the water evaporates, gets drank by animals, seeps through the ground, but comes back with rain; the water of the lake fills a hole in the landscape. So, you could say the lake maximizes "refillment" in different ways.

^{^}
By the way, I think the same sort of definition could be used to define "optimization". Here's a comment with an earlier iteration of this idea.

A question about natural latents.

Imagine you have 100 blue boxes. Each time you roll the dice, their shape changes. But all 100 boxes always share the same shape. If I understand correctly, in this situation the shape is the natural latent. While color is just static background information.
Imagine you have 100 boxes. Each time you roll the dice, their color changes. But all 100 boxes always share the same color. In this situation, color is the natural latent.
Imagine you have 100 boxes. Each time you roll the dice, their color changes. If at least one box is blue, all boxes are blue. Otherwise their color is independent. Is "all boxes are blue or all boxes have independent colors" a natural latent (it's something you learn about all boxes by examining a single box)?

Does the latter (3) type of natural latents have any special properties, is it some sort of "meta-level" natural latent (compared to 2)? I'm asking because I think this type of latents might be relevant to how human abstractions work. Here's where I wrote about it in more detail.

Yes, it could be that "special, inherently more alignable cognition" doesn't exist or can't be discovered by mere mortal humans. It could be that humanlike reasoning isn't inherently more alignable. Finally, it could be that we can't afford to study it because the dominating paradigm is different. Also, I realize that glass box AI is a pipe dream.

Wrt sociopaths/psychopaths. I'm approaching it from a more theoretical standpoint. If I knew a method of building a psychopath AI (caring about something selfish, e.g. gaining money or fame or social power or new knowledge or even paperclips) and knew the core reasons of why it works, I would consider it a major progress. Because it would solve many alignment subproblems, such as ontology identification and subsystems alignment.

I'm approaching it from a "theoretical" perspective^[1], so I want to know how "humanlike reasoning" could be defined (beyond "here's some trusted model which somehow imitates human judgement") or why human-approved capability gain preserves alignment (like, what's the core reason, what makes human judgement good?). So my biggest worry is not that the assumption might be false, but that the assumption is barely understood on the theoretical level.

What are your research interests? Are you interested in defining what "explanation" means (or at least in defining some properties/principles of explanations)? Typical LLM stuff is highly empirical, but I'm kinda following the pipe dream of glass box AI.

^{^}
I'm contrasting theoretical and empirical approaches.
Empirical - "this is likely to work, based on evidence". Theoretical - "this should work, based on math / logic / philosophy".
Empirical - "if we can operationalize X for making experiments, we don't need to look for a deeper definition". Theoretical - "we need to look for a deeper definition anyway".

Give AGI humanlike reasoning? (draft of a post)

Alignment plans can be split into two types:

Usual plans. AI gains capabilities . We figure out how to point $C$ to $T$ (alignment target). There's no deep connection between $C$ and $T$ . One thing is mounted onto the other.

HRLM plans. We give AI special $C$ , with a deep connection to $T$ .

HRLM is the idea that there's some special reasoning/learning method which is crucial for alignment or makes it fundamentally easier. HRLM means "humanlike reasoning/learning method" or "special, human-endorsed reasoning/learning method". There's no hard line separating the two types of plans. It's a matter of degree.

I believe HRLM is ~never discussed in full generality and ~never discussed from a theoretical POV. This is a small post where I want to highlight the idea and facilitate discussion, not make a strong case for it.

Examples of HRLM

(My description of other people's work is not endorsed by them.)

Corrigibility. "Corrigible cognition" is a hypothetical, special type of self-reflection ( $C$ ) which is extremely well-suited for learning human values/desires ( $T$ ).

In "Don't align agents to evaluations of plans" Alex Turner argues "there's a correct way to reason ( $C$ ) about goals ( $T$ ) and consequentialist maximization of an 'ideal' function is not it", "'direct cognition' ( $C$ ) about goals ( $T$ ) is fundamentally better than 'indirect cognition'". Shard Theory, in general, proposes a very special method for learning and thinking about values.

A post about "follow-the-trying game" by Steve Byrnes basically says "AI will become aligned or misaligned at the stage of generating thoughts, so we need to figure out the 'correct' way of generating thoughts ( $C$ ), instead of overfocusing on judging what thoughts are aligned ( $T$ )". Steve's entire agenda is about HRLM.

Large Language Models. I'm not familiar with the debate, but I would guess it boils down to two possibilities: "understanding human language is a core enough capability ( $C$ ) for a LLM, which makes it inherently more alignable to human goals ( $T$ )" and "LLMs 'understand' human language through some alien tricks which don't make them inherently more alignable". If the former is true, LLMs are an example of HRLM.

Policy Alignment (Abram Demski) is tangentially related, but it's more in the camp of "usual plans".

Notice how, despite multiple agendas falling under HRLM (Shard Theory, brain-like AGI, LLM-focused proposals), there's almost no discussion of HRLM from a theoretical POV. What is, abstractly speaking, "humanlike reasoning"? What are the general principles of it? What are the general arguments for safety guarantees it's supposed to bring about? What are the True Names here? With Shard Theory, there's ~zero explanation of how simpler shards aggregate into more complex shards and how it preserves goals. With brain-like AGI, there's ~zero idea of how to prevent thought generation from bullshitting thought assessment. But those are the very core questions of the agendas. So they barely move us from square one.^[1]

Possibilities

There are many possibilities. It could be that any HRLM handicaps AI's capabilities (a superintelligence is supposed to be unimaginably better at reasoning than humans, so why wouldn't it have an alien reasoning method). It also could be that HRLM is necessary for general intelligence. But maybe general intelligence is overrated...

Here's what I personally believe right now:

What we value is inherently tied to how we think about it. In general, what we think about is often inherently tied to how we think about it.
General intelligence is based on a special principle. It has a relatively crisp "core".
Some special computational principle is needed for solving subsystems alignment.
If 1-3 is true, 1-3 is most likely the same thing. Therefore, HRLM is needed for general intelligence, outer and inner alignment (including subsystems alignment). Separately, I think general intelligence boosts capabilities below peak human level.

I consider 1-3 to be plausible enough postulates. I have no further arguments for 4.

My own ideas about HRLM (to be updated)

I have a couple of very unfinished ideas. Will try to write about them this or the next month.

I believe there could be a special type of cognition which helps to avoid specification gaming and goal misgeneralization. AI should create simple models which describe "costs/benefits" of actions (e.g. "actions" can be body movements, "cost" can be the amount and complexity of movements, "benefit" can be distance covered), this way AI can notice if certain actions produce anomalously high benefit (e.g. maybe certain body movements exploit a glitch in the physics simulation, making the body cover kilometers per second).

"By default, manipulating easier to optimize/comprehend variables is better than manipulating harder to optimize/comprehend variables" — this is the idea from one of my posts. The problem with it is that I only defined "optimization" and "comprehension" for world-models, not for any modelling (= cognition) in general.

A formal algorithm can have parts and it will critically depend on those parts (for example, an algorithm for solving equations might have an absolutely necessary addition sub-algorithm). An informal algorithm can have parts without critically depending on those parts (for example, the algorithm answering "is this a picture of a dog?" might have a sub-algorithm answering "is this patch of pixels the focal point of the image / does it contrast enough with other patches / is it as detailed as the other patches?" - the sub-algorithm is not very necessary, but it lowers pareidolia, by preventing the algorithm from overanalyzing random parts of the image). I think we can say something about the latter type of algorithms, about how they work.

^{^}
IMO that's downstream of inner alignment being extremely hard. It's almost impossible to come up with at least mildly promising solution which explains, at least in some detail, how the hardest part of the problem might get solved. I'm not trying to throw shade. Also, I might just be ignorant about some ideas in those agendas.

I'm currently thinking this is solved by abstraction hierarchies. (...) Or something vaguely like that; this doesn't exactly work either. I'll have more to say about this once I finish distilling my notes for external consumption instead of expanding them, which is going to happen any... day... now...

Could you say more about why this attempt to solve the problem (by a hierarchy of abstractions) doesn't work? Even if your thoughts are very unfinished.

(Here's an observation about adjectives, verbs, and language in general. It might be important even if I'm misinterpreting the definition of natural latents.)

For many adjectives, we can define the concept "salience of <insert an adjective>". Salience of color / texture / shape / size / etc.

For example, what's "salience of a texture"? It's a function of how much of the texture is present (in your field of view) and how strongly the texture contrasts with other present textures.

We can learn an empirical rule: "if a texture is salient enough, then it's probably caused by a single object or a single kind of objects".^[1] Yet this object or kind is random. Would this make "being caused by an object/kind X" a natural latent over "pixels with a salient texture Y" and vice-versa? An object's texture tends to be similarly salient in many different situations, so a particular value of "salience of a texture" can itself be a natural latent. "Salience of a texture" is not the same thing as "a texture", but it's one of the reasons why textures are important.

landscape example

Similarly, we can consider "salience of an action". With an empirical rule like "salient actions (e.g. salient movements) are usually caused by a single object / a single kind of objects / a single causal process".^[2] Such rule makes fine-grained classification of actions important, making action-related words important.

Beyond "salience of a texture", we can consider concepts like "danger of an object" (is it a good idea to touch or step on?) and "traversability of an object" (can you walk/climb on it?). For many types of objects, those concepts will be independent from most facts about the objects but their texture. Let's call concepts like this (salience, danger, traversability, etc.) "auxiliary".

Maybe this sort of reasoning can explain analogies, connotation. Because auxiliary concepts are a type of connotation.

^{^}
Imagine looking at a nature landscape. You notice a bunch of angular texture (mountains), a bunch of puffy texture (clouds), a bunch fluffy texture (trees), a bunch of smooth texture (fields). The rule says each type of texture most likely belongs to a single object or kind. Note that it's not trivial - we could live in a world where we often see radically different things with similar, equally salient textures at the same time.
^{^}
For example, imagine you see a burst of flame and rubble flying around - most likely it's a single causal process, an explosion.

I have a couple of questions/points. Might stem from not understanding the math.

1) The very first example shows that absolutely arbitrary things (e.g. arbitrary green lines) can be "natural latents". Does it mean that "natural latents" don't capture the intuitive idea of "natural abstractions"? That all natural abstractions are natural latents, but not all natural latents are natural abstractions. You seem to be confirming this interpretation, but I just want to be sure:

So we view natural latents as a foundational tool. The plan is to construct more expressive structures out of them, rich enough to capture the type signatures of the kinds of concepts humans use day-to-day, and then use the guarantees provided by natural latents to make similar isomorphism claims about those more-complicated structures. That would give a potential foundation for crossing the gap between the internal ontologies of two quite different minds.

Is there any writing about what those "more expressive structures" could be?

2) Natural latents can describe both things which propagate through very universal, local physical laws (e.g. heat) and any commonalities in made up categories (e.g. "cakes"). Natural latents seem very interesting in the former case, but I'm not sure about the latter. Not sure the analogy between the two gives any insight. I'm still not seeing any substantial similarity between cakes and heat or Ising models. I.e. I see that an analogy can be made, but I don't feel that this analogy is "grounded" in important properties of reality (locality, universality, low-levelness, stability, etc). Does this make sense?

3) I don't understand what "those properties can in-principle be well estimated by intensive study of just one or a few mai tais" (from here) means. To me a natural latent is something like ~"all words present in all of 100 books", it's impossible to know unless you read every single book.

If I haven't missed anything major, I'd say core insights about abstractions are still missing.

EDIT 17/07: I did miss at least one major thing. I haven't understood the independence condition. If you take all words present in all 100 books, it doesn't guarantee that those words make the books or their properties independent.

LESSWRONG
LW