Discord: LemonUniverse (.lemonuniverse). Reddit: u/Smack-works. About my situation: here.
I wrote some bad posts before 2024 because I was very uncertain how the events may develop.
I do philosophical/conceptual research, have no mathematical or programming skills. But I do know a bunch of mathematical and computer science concepts.
Yes, it could be that "special, inherently more alignable cognition" doesn't exist or can't be discovered by mere mortal humans. It could be that humanlike reasoning isn't inherently more alignable. Finally, it could be that we can't afford to study it because the dominating paradigm is different. Also, I realize that glass box AI is a pipe dream.
Wrt sociopaths/psychopaths. I'm approaching it from a more theoretical standpoint. If I knew a method of building a psychopath AI (caring about something selfish, e.g. gaining money or fame or social power or new knowledge or even paperclips) and knew the core reasons of why it works, I would consider it a major progress. Because it would solve many alignment subproblems, such as ontology identification and subsystems alignment.
I'm approaching it from a "theoretical" perspective[1], so I want to know how "humanlike reasoning" could be defined (beyond "here's some trusted model which somehow imitates human judgement") or why human-approved capability gain preserves alignment (like, what's the core reason, what makes human judgement good?). So my biggest worry is not that the assumption might be false, but that the assumption is barely understood on the theoretical level.
What are your research interests? Are you interested in defining what "explanation" means (or at least in defining some properties/principles of explanations)? Typical LLM stuff is highly empirical, but I'm kinda following the pipe dream of glass box AI.
I'm contrasting theoretical and empirical approaches.
Empirical - "this is likely to work, based on evidence". Theoretical - "this should work, based on math / logic / philosophy".
Empirical - "if we can operationalize X for making experiments, we don't need to look for a deeper definition". Theoretical - "we need to look for a deeper definition anyway".
Alignment plans can be split into two types:
Usual plans. AI gains capabilities . We figure out how to point to (alignment target). There's no deep connection between and . One thing is mounted onto the other.
HRLM plans. We give AI special , with a deep connection to .
HRLM is the idea that there's some special reasoning/learning method which is crucial for alignment or makes it fundamentally easier. HRLM means "humanlike reasoning/learning method" or "special, human-endorsed reasoning/learning method". There's no hard line separating the two types of plans. It's a matter of degree.
I believe HRLM is ~never discussed in full generality and ~never discussed from a theoretical POV. This is a small post where I want to highlight the idea and facilitate discussion, not make a strong case for it.
(My description of other people's work is not endorsed by them.)
Corrigibility. "Corrigible cognition" is a hypothetical, special type of self-reflection () which is extremely well-suited for learning human values/desires ().
In "Don't align agents to evaluations of plans" Alex Turner argues "there's a correct way to reason () about goals () and consequentialist maximization of an 'ideal' function is not it", "'direct cognition' () about goals () is fundamentally better than 'indirect cognition'". Shard Theory, in general, proposes a very special method for learning and thinking about values.
A post about "follow-the-trying game" by Steve Byrnes basically says "AI will become aligned or misaligned at the stage of generating thoughts, so we need to figure out the 'correct' way of generating thoughts (), instead of overfocusing on judging what thoughts are aligned ()". Steve's entire agenda is about HRLM.
Large Language Models. I'm not familiar with the debate, but I would guess it boils down to two possibilities: "understanding human language is a core enough capability () for a LLM, which makes it inherently more alignable to human goals ()" and "LLMs 'understand' human language through some alien tricks which don't make them inherently more alignable". If the former is true, LLMs are an example of HRLM.
Policy Alignment (Abram Demski) is tangentially related, but it's more in the camp of "usual plans".
Notice how, despite multiple agendas falling under HRLM (Shard Theory, brain-like AGI, LLM-focused proposals), there's almost no discussion of HRLM from a theoretical POV. What is, abstractly speaking, "humanlike reasoning"? What are the general principles of it? What are the general arguments for safety guarantees it's supposed to bring about? What are the True Names here? With Shard Theory, there's ~zero explanation of how simpler shards aggregate into more complex shards and how it preserves goals. With brain-like AGI, there's ~zero idea of how to prevent thought generation from bullshitting thought assessment. But those are the very core questions of the agendas. So they barely move us from square one.[1]
There are many possibilities. It could be that any HRLM handicaps AI's capabilities (a superintelligence is supposed to be unimaginably better at reasoning than humans, so why wouldn't it have an alien reasoning method). It also could be that HRLM is necessary for general intelligence. But maybe general intelligence is overrated...
Here's what I personally believe right now:
I consider 1-3 to be plausible enough postulates. I have no further arguments for 4.
I have a couple of very unfinished ideas. Will try to write about them this or the next month.
I believe there could be a special type of cognition which helps to avoid specification gaming and goal misgeneralization. AI should create simple models which describe "costs/benefits" of actions (e.g. "actions" can be body movements, "cost" can be the amount and complexity of movements, "benefit" can be distance covered), this way AI can notice if certain actions produce anomalously high benefit (e.g. maybe certain body movements exploit a glitch in the physics simulation, making the body cover kilometers per second).
"By default, manipulating easier to optimize/comprehend variables is better than manipulating harder to optimize/comprehend variables" — this is the idea from one of my posts. The problem with it is that I only defined "optimization" and "comprehension" for world-models, not for any modelling (= cognition) in general.
A formal algorithm can have parts and it will critically depend on those parts (for example, an algorithm for solving equations might have an absolutely necessary addition sub-algorithm). An informal algorithm can have parts without critically depending on those parts (for example, the algorithm answering "is this a picture of a dog?" might have a sub-algorithm answering "is this patch of pixels the focal point of the image / does it contrast enough with other patches / is it as detailed as the other patches?" - the sub-algorithm is not very necessary, but it lowers pareidolia, by preventing the algorithm from overanalyzing random parts of the image). I think we can say something about the latter type of algorithms, about how they work.
IMO that's downstream of inner alignment being extremely hard. It's almost impossible to come up with at least mildly promising solution which explains, at least in some detail, how the hardest part of the problem might get solved. I'm not trying to throw shade. Also, I might just be ignorant about some ideas in those agendas.
(Here's an observation about adjectives, verbs, and language in general. It might be important even if I'm misinterpreting the definition of natural latents.)
For many adjectives, we can define the concept "salience of <insert an adjective>". Salience of color / texture / shape / size / etc.
For example, what's "salience of a texture"? It's a function of how much of the texture is present (in your field of view) and how strongly the texture contrasts with other present textures.
We can learn an empirical rule: "if a texture is salient enough, then it's probably caused by a single object or a single kind of objects".[1] Yet this object or kind is random. Would this make "being caused by an object/kind X" a natural latent over "pixels with a salient texture Y" and vice-versa? An object's texture tends to be similarly salient in many different situations, so a particular value of "salience of a texture" can itself be a natural latent. "Salience of a texture" is not the same thing as "a texture", but it's one of the reasons why textures are important.
landscape example
Similarly, we can consider "salience of an action". With an empirical rule like "salient actions (e.g. salient movements) are usually caused by a single object / a single kind of objects / a single causal process".[2] Such rule makes fine-grained classification of actions important, making action-related words important.
Beyond "salience of a texture", we can consider concepts like "danger of an object" (is it a good idea to touch or step on?) and "traversability of an object" (can you walk/climb on it?). For many types of objects, those concepts will be independent from most facts about the objects but their texture. Let's call concepts like this (salience, danger, traversability, etc.) "auxiliary".
Maybe this sort of reasoning can explain analogies, connotation. Because auxiliary concepts are a type of connotation.
Imagine looking at a nature landscape. You notice a bunch of angular texture (mountains), a bunch of puffy texture (clouds), a bunch fluffy texture (trees), a bunch of smooth texture (fields). The rule says each type of texture most likely belongs to a single object or kind. Note that it's not trivial - we could live in a world where we often see radically different things with similar, equally salient textures at the same time.
For example, imagine you see a burst of flame and rubble flying around - most likely it's a single causal process, an explosion.
I have a couple of questions/points. Might stem from not understanding the math.
1) The very first example shows that absolutely arbitrary things (e.g. arbitrary green lines) can be "natural latents". Does it mean that "natural latents" don't capture the intuitive idea of "natural abstractions"? That all natural abstractions are natural latents, but not all natural latents are natural abstractions. You seem to be confirming this interpretation, but I just want to be sure:
So we view natural latents as a foundational tool. The plan is to construct more expressive structures out of them, rich enough to capture the type signatures of the kinds of concepts humans use day-to-day, and then use the guarantees provided by natural latents to make similar isomorphism claims about those more-complicated structures. That would give a potential foundation for crossing the gap between the internal ontologies of two quite different minds.
Is there any writing about what those "more expressive structures" could be?
2) Natural latents can describe both things which propagate through very universal, local physical laws (e.g. heat) and any commonalities in made up categories (e.g. "cakes"). Natural latents seem very interesting in the former case, but I'm not sure about the latter. Not sure the analogy between the two gives any insight. I'm still not seeing any substantial similarity between cakes and heat or Ising models. I.e. I see that an analogy can be made, but I don't feel that this analogy is "grounded" in important properties of reality (locality, universality, low-levelness, stability, etc). Does this make sense?
3) I don't understand what "those properties can in-principle be well estimated by intensive study of just one or a few mai tais" (from here) means. To me a natural latent is something like ~"all words present in all of 100 books", it's impossible to know unless you read every single book.
If I haven't missed anything major, I'd say core insights about abstractions are still missing.
EDIT 17/07: I did miss at least one major thing. I haven't understood the independence condition. If you take all words present in all 100 books, it doesn't guarantee that those words make the books or their properties independent.
Have an idea about interpretability and defining search/optimization.
Finite algorithms can solve infinite (classes of) problems. For example, the algorithm for adding two numbers has a finite description, yet can solve an infinity of examples.
This is a basic truth of computability theory.
Intuitively, it means that algorithms can exploit regularities in problems. But "regularity" here can only be defined tautologically (any smaller thing which solves/defines a bigger thing).
Many algorithms have the following property:
Intuitively, it means that algorithms can exploit regularities in problems. However, here "regularity" has a stricter definition than in the trivial property (TP). A "regularity" is an easily computable thing which gives you important, easily computable information about a hard-to-compute thing.
Now, the question is: what classes of algorithms have the less trivial property (LTP)?
LTP includes a bunch of undefined terms:
Probably giving fully general definitions right away is not important. We could try starting with overly restrictive definitions and see if we can prove anything about those.
If neural networks implement algorithms with LTP, we could try finding those algorithms by looking for (which is much easier than looking for ).
Furthermore, LTP seems very relevant to defining search / optimization.
Examples of algorithms with LTP:
Got around to interrogating Gemini for a bit.
Seems like KSF talks about programs generating sets. It doesn't say anything about the internal structure of the programs (but that's where the objects such as "real diamonds" live). So let's say is a very long video about dogs doing various things. If I apply KSF, I get programs (aka "codes") generating sets of videos. But it doesn't help me identify "the most dog-like thing" inside each program. For example, one of the programs might be an atomic model of physics, where "the most dog-like things" are stable clouds of atoms. But KSF doesn't help me find those clouds. A similarity metric between videos doesn't help either.
My conceptual solution to the above problem, proposed in the post: if you have a simple program with special internal structure describing simple statistical properties of "dog-shaped pixels" (such program is guaranteed to exist), there also exists a program with very similar internal structure describing "valuable physical objects causing dog-shaped pixels" (if such program doesn't exist, then "valuable physical objects causing dog-shaped pixels" don't exist either).[1] Finding "the most dog-like things" in such program is trivial. Therefore, we should be able to solve ontology identification by heavily restricting the internal structure of programs (to structures which look similar to simple statistical patterns in sensory data).
So, to formalize my "conceptual solution" we need models which are visually/structurally/spatially/dynamically similar to the sensory data they model. I asked Gemini about it, multiple times, with Deep Research. The only interesting reference Gemini found is Agent-based models (AFAIU, "agents" just means "any objects governed by rules").
This is not obvious, requires analyzing basic properties of human values.
Then you can use the three dot points in my comment to construct source code for a new agent that does the same thing, but is not nicely separated.
This is the step I don't get (how we make the construction), because I don't understand SGD well. What does "sample N world models" mean?
My attempt to understand: We have a space of world models () and a space of plans (). We pick points from (using SGD) and evaluate them on the best points of (we got those best points by trying to predict the world and applying SGD).
My thoughts/questions: To find the best points of , we still need to do modelling independently from planning? While the world model is not stored in memory, some pointer to the best points of is stored? We at least have "the best current plan" stored independently from the world models?
A question about natural latents.
Does the latter (3) type of natural latents have any special properties, is it some sort of "meta-level" natural latent (compared to 2)? I'm asking because I think this type of latents might be relevant to how human abstractions work. Here's where I wrote about it in more detail.