The "Minimal Latents" Approach to Natural Abstractions

[-]Vivek Hebbar3yΩ685

In ML terms, nearly-all the informational work of learning what “apple” means must be performed by unsupervised learning, not supervised learning. Otherwise the number of examples required would be far too large to match toddlers’ actual performance.

I'd guess the vast majority of the work (relative to the max-entropy baseline) is done by the inductive bias.

[-]Rohin Shah3yΩ687

You don't need to guess; it's clearly true. Even a 1 trillion parameter network where each parameter is represented with 64 bits can still only represent at most different functions, which is a tiny tiny fraction of the full space of $2^{2^{8, 000, 000}}$ possible functions. You're already getting at least $2^{8, 000, 000} - 64, 000, 000, 000, 000$ of the bits just by choosing the network architecture.

(This does assume things like "the neural network can learn the correct function rather than a nearly-correct function" but similarly the argument in the OP assumes "the toddler does learn the correct function rather than a nearly-correct function".)

[-]LawrenceC3yΩ342

See also Superexponential Concept Space, and Simple Words, from the Sequences:

By the time you're talking about data with forty binary attributes, the number of possible examples is past a trillion—but the number of possible concepts is past two-to-the-trillionth-power. To narrow down that superexponential concept space, you'd have to see over a trillion examples before you could say what was In, and what was Out. You'd have to see every possible example, in fact.
[...]
From this perspective, learning doesn't just rely on inductive bias, it is nearly all inductive bias—when you compare the number of concepts ruled out a priori, to those ruled out by mere evidence.

[-]davidad3yΩ472

As a category theorist, I am confused by the diagram that you say you included to mess with me; I’m not even sure what I was supposed to think it means (where is the cone for ? why does the direction of the arrow between $Λ^{*}$ and $Λ$ seem inconsistent?).

I think a “minimal latent,” as you have defined it equationally, is a categorical product (of the $X_{i}$ ) in the coslice category $Ω ↓ S t o c h$ where $S t o c h$ is the category of Markov kernels and $Ω$ is the implicit sample space with respect to which all the random variables are defined.

[-]tailcalled3y40

Are you sure? Wouldn't the categorical product need to make the independent not just from each other but also from $Ω$ ?

[-]Lucius Bushnaq3y52

Epistemic status: sleep deprived musings

If I understand this right, this is starting to sound very testable.

Feed a neural network inputs consisting of variables . Configurations in a 2D Ising model, cat pictures, or anything else we humans think we know the latent variables for.

Train neural networks to output a set of variables $Λ$ over the inputs. The loss function scores based on how much the output induces conditional independence of inputs over the training data set.

E.g., take the $D_{K L}$ divergence between $P (X_{1}, \dots, X_{n} | Λ)$ and $P (X_{1} | Λ) P (X_{2} | Λ) \dots P (X_{n} | Λ)$ . Then, penalise $Λ$ with a higher information content through a regularisation term. E.g. the $D_{K L}$ divergence between $P (X_{1}, \dots, X_{n} | Λ)$ and $P (X_{1}, \dots, X_{n})$ .^[1]

Then, you can check if the solutions found match the ones other networks, or humans and human science, would give for that system. Either by comparing $P (X_{1}, \dots, X_{n} | Λ)$ , or by looking at $Λ$ directly.

You can also train a second network to reconstruct $[x_{1}, . ., x_{n}]$ from the latents and see what comes out.

You might also be able to take a network stitched together from the latent generation and a latent read-out network, and see how well it does on various tasks over the dataset. Image labelling, calculating the topological charge of field configurations, etc. Then compare that to a generic network trained to solve these tasks.

If the hypothesis holds strongly, generic networks of sufficient generality go through the same process when they propose solutions. Just with less transparency. So you’d expect the stitched and conventional networks to score similarly.

My personal prediction would be that you'd usually need to require solving a few different tasks on the data for that to occur, otherwise the network doesn't need to understand all the abstractions in the system to get the answer, and can get away with learning less latents.

I think we kind of informally do a lot of this already when we train an image classifier and then Anthropic opens up the network to see e.g. the dog-head-detection function/variable in it. But this seems like a much cleaner, more secure and well defined cooking formula for finding latents to me, which may be rote implementable for any system or problem.

Unless this is already a thing in Interpretability code bases and I don’t know about it?

^{^}
I haven't checked the runtime on this yet, you'd need to cycle through the whole dataset once per loss function call to get the distributions. But it's probably at least doable for smaller problems, and for bigger datasets a stochastic sampling ought to be enough.

[-]Shmi3y51

nearly-all the informational work done in a toddler’s mind of figuring out which pattern is referred to by the word “apple” must be performed by priors and general observations of the world, not by examples of apples specifically.

Mildly related: Most image/sound/etc. lossy compression algorithms (and that is what an abstraction is, a form of lossy data compression) are based on the Discrete Cosine Transform. Do you think that the brain does something like the DCT when relating visible apples to the concept of apple?

[-]Jon Garcia3y32

The cortex uses traveling waves of activity that help it organize concepts in space and time. In other words, the locally traveling waves provide an inductive bias for treating features that occur close together in space and time as part of the same object or concept. As a result, cortical space ends up mapping out conceptual space, in addition to retinotopic, somatic, or auditory space.

This is kind of like DCT in the sense that oscillations are used as a scaffold for storing or reconstructing information. I think that Neural Radiance Fields (NeRF) use a similar concept, using positional encoding (3D coordinates plus viewing angle, rather than 2D pixel position) to generate images, especially when the positional encoding uses Fourier features. Of course, Transformers also use such sinusoidal positional encodings to help with natural language understanding.

All that is to say that I agree with you. Something similar to DCT will probably be very useful for discovering natural abstractions. For one thing, I imagine that these sorts of approaches could help overcome texture bias in DNNs by incorporating more large-scale shape information.

[-]Shmi3y30

Thanks! Your links led me down some interesting avenues.

[-]Thane Ruthenis3y*Ω350

This touches on some issues I'd wanted to discuss: abstraction hierarchies, and incompatible abstraction layers.

So, here’s a new conditional independence condition for “large” systems, i.e. systems with an infinite number of ’s: given $Λ$ , any finite subset of the $X_{i}$ ’s must be approximately independent (i.e. mutual information below some small $ϵ$ ) of all but a finite number of the other $X_{i}$ ’s

Suppose we have a number of tree-instances $X_{1}, X_{2}, . . ., X_{n}$ . Given a sufficiently large $ϵ$ , we can compute a valid "general tree abstraction". But what if we've picked a lower $ϵ$ , and are really committed to keeping it low, for some reason?

Here's a trick:

We separate tree-instances into sets $S_{1}, S_{2}, . . ., S_{m}$ such that we can compute the corresponding "first-order" abstractions $Λ_{1}, Λ_{2}, . . ., Λ_{m}$ over each set, and they would be valid, in the sense that any two $X_{i}, X_{j} \in S_{k}$ would have mutual information below $ϵ$ when conditioned on $Λ_{k}$ ^[1]. Plausibly, that would recover a set of abstractions corresponding to "tree species".

Then we repeat the trick: split the first-order abstractions $Λ_{1}, Λ_{2}, . . ., Λ_{m}$ into sets, and generate second-order abstractions $Λ_{1}^{II}, Λ_{2}^{II}, . . ., Λ_{q}^{II}$ . That may recover, say, genuses.

We do this iteratively until getting a single nth-order abstraction $Λ^{Ω}$ , standing-in for "all trees".

I think it would all have sensible behavior. Conditioning any given tree-instance $X_{i}$ on $Λ^{Ω}$ would only explain general facts about the trees, as we wanted. Conditioning on the appropriate lower-level abstractions would explain progressively more information about $X_{i}$ . Conditioning a $X_{i} \notin S_{j}$ on $Λ_{j}^{I}$ , in turn, would turn up some information that's in excess, or make some wrong predictions, but get the general facts right. (And you can also condition first-order abstractions on higher-order abstractions, etc.)

The question is: how do we pick $ϵ$ ? One potential answer is that, given some set of instances $X_{1}, X_{2}, . . ., X_{n}$ , we always try for the lowest $ϵ$ possible^[1]. Perhaps that's the mathematical description of taxonomy, even? "Given a set of instances, generate the abstraction hierarchy that minimizes $ϵ$ at each abstraction-level."

There's a different way to go about it, though. Suppose that, instead of picking $ϵ$ and then deciding on groupings, we first split instances $X_{1}, X_{2}, . . ., X_{n}$ into sets, according to some rule? We have to be able to do that: we've somehow decided to abstract over these specific $X_{1}, X_{2}, . . ., X_{n}$ to begin with, so we already have some way to generate groupings. (We've somehow arrived at a set of tree-instances to abstract over, instead of a mixture of cars, trees, towels, random objects...)

So, we pick some "rule", which is likely a natural abstraction in itself, or defined over one. Like "trees that are N years old" with separate set for every N, or "this tree has leaves" y/n, or "trees in %person%'s backyard" for every %person%. Then we split the instances into sets according to that rule, and try to summarize every set.

Important: that way, we may get meaningfully different $ϵ$ s for every set! For example, suppose we cluster trees by whose backyard they're in.

Person A has trees of several different species growing in their yard. For them, we compute $S_{A}$ , the corresponding abstraction/summary $Φ_{A}$ ^[2], and some $ϵ_{A}$ that makes $Φ_{A}$ be a valid abstraction.
Person B only plants trees of a single species. Again, we compute $S_{B}$ , $Φ_{B}$ , $ϵ_{B}$ .
Obviously, $ϵ_{A} > ϵ_{B}$ .

What does this approach yield us?

It's a tool of analysis. We can try different rules on for size, and see if that reveals any interesting data. (Do most people grow only trees of a single species in their yard?)
It's potentially useful for general-purpose search via constraints. Consider two different first-order abstractions, "trees of species z" $Λ_{z}$ and trees-in-my-backyard $Φ_{my}$ . Computing the second-order abstraction from them would be rather arbitrary, but it's something we may want to do during a specific planning process!
- (Though note that combining any two nth-order abstractions would result in a (n+1)th-order abstraction that has at least as much information as $Λ^{Ω}$ . I. e., any given valid abstraction hierarchy over a given set of instances terminates in the same max-level abstraction. I'm not sure if that's useful.)
It allows abstraction layers, as outlined below.

Consider humans, geopolitical entities, and ideological movements. They don't have a clear hierarchy: while humans are what constitutes the latter two "layers", ideological movements are not split across geopolitical lines (same ideologies can be present in different countries), and geopolitical entities are not split along ideological lines (a given government can have multiple competing ideologies). By implication, once you're viewing the world in terms of ideologies, you can't recover governments from this data; nor vice versa.

Similarly: As we've established, we can split trees by species $Λ_{1}^{I}, . . ., Λ_{n}^{I}$ and by "whose backyard they're in" $Φ_{1}^{I}, . . ., Φ_{g}^{I}$ . But: we would not be able to recover genuses $Λ_{1}^{II}, . . ., Λ_{d}^{II}$ from the backyard-data $Φ_{1}^{I}, . . ., Φ_{g}^{I}$ ! Once we've committed to the backyard-classification, we've closed-off species-classification!

I propose calling such incompatible abstraction hierarchies abstraction layers. Behind every abstraction layer, there's some rule by which we're splitting instances into sets, and such rules are/are-defined-over natural abstractions, in turn.

Does all that make sense, on your model?

^{^}
And, I guess, such that there's at least one set with more than one instance, to forbid the uninteresting trivial case where there's a one-member set for every initial instance. More generally, we'd want the number of sets to be "small" compared to the number of instances, in some sense of "small".
^{^}
Reason for the change in notation from $Λ$ will be apparent later.
^{^}
Or maybe it's still useful, for general-purpose search via constraints?

[-]DanielFilan3yΩ440

Empirically, human toddlers are able to recognize apples by sight after seeing maybe one to three examples. (Source: people with kids.)

Wait but they see a ton of images that they aren't told contain apples, right? Surely that should count. (Probably not 2^big_number bits tho)

[-]johnswentworth3yΩ340

Yes! There's two ways that can be relevant. First, a ton of bits presumably come from unsupervised learning of the general structure of the world. That part also carries over to natural abstractions/minimal latents: the big pile of random variables from which we're extracting a minimal latent is meant to represent things like all those images the toddler sees over the course of their early life.

Second, sparsity: most of the images/subimages which hit my eyes do not contain apples. Indeed, most images/subimages which hit my eyes do not contain instances of most abstract object types. That fact could either be hard-coded in the toddler's prior, or learned insofar as it's already learning all these natural latents in an unsupervised way and can notice the sparsity. So, when a parent says "apple" while there's an apple in front of the toddler, sparsity dramatically narrows down the space of things they might be referring to.

[-]Thane Ruthenis3yΩ340

What are your current thoughts on the exact type signature of abstractions? In the Telephone Theorem post, they're described as distributions over the local deterministic constraints. The current post also mentions that the "core" part of an abstraction is the distribution , and its ability to explain variance in individual instances of $X_{i}$ .

Applying the deterministic-constraint framework to trees, I assume it says something like "given certain ground-truth conditions (e. g., the environment of a savannah + the genetic code of a given tree), the growth of tree branches of that tree species is constrained like so, the rate of mutation is constrained like so, the spread of saplings like so, and therefore we should expect to see such-and-such distribution of trees over the landscape, and they'll have such-and-such forms".

Is that roughly correct? Have you arrived at any different framework for thinking about type signatures?

[-]johnswentworth3yΩ340

Roughly, yeah. I currently view the types of and $P [X | Λ]$ as the "low-level" type signature of abstraction, in some sense to be determined. I expect there are higher-level organizing principles to be found, and those will involve refinement of the types and/or different representations.

[-]DragonGod3y43

A hypothesis based on this post:

Consider the subset of "human values" that we'd be "happy" (where we fully informed) for powerful systems to optimise for.

[Weaker version: "the subset of human values that it is existentially safe for powerful systems to optimise for".]

Let's call this subset "ideal values".

I'd guess that the "most natural" abstraction of values isn't "ideal values" themselves but something like "the minimal latents of ideal values".

Examples of what I mean by a concept being a "more natural" abstraction:

The concept is highly privileged by the inductive biases of most learning algorithms that can efficiently learn our universe
- More privileged → more natural
Most efficient representations of our universe contain simple embeddings of the concept
- Simpler embeddings → more natural

[-]romeostevensit3yΩ340

Related background on the philosophical problem: gavagai

[-]tailcalled3y40

One thing I've also been thinking about is how concepts spread in large-scale social networks. If you've got a social network A - B - C - D, i.e. where person A knows person B and person B knows person C, but person A does not know person C, and so on, then it's possible that basically none of the concrete things that person A's ideas are about will be things that person D knows of. However, many abstract/general things might still apply; so memes that are about general information can spread much further.

I suspect we're underrating the extent to which this affects our concept-language.

[-]Ben Amitay3y10

I also find the question very interesting, but have different intuition about what travel father. I think that in general, concrete things are actually quite similar wherever there are humans, at least in the distances that where relevant for most of our history. If I am a Judean I know what a cow looks like, and every other Hebrew speaker know too, and almost every speaker of any language similar to Hebrew knows too - though maybe they have a little different variant of cow. From the other hand, if I'm a Hindu starting a new religion that is about how to get enlightenment - chances are that in the next greenstone there would be 4 competing schools with mutually exclusive understanding of the word "enlightenment". The reason is that we generally synchronize our language around shared experience of the concrete, and have much less degrees of freedom when conceptualising it.

[-]tailcalled3y20

"A cow" is abstract and general, Betty the cow who you have years of experience with is concrete.

[-]Ben Amitay3y10

Abstract and general are spectra. I agree that the maximally-specific is not good at spreading - but neither is the maximally-general

[-]tailcalled3y20

General might be more relevant than abstract.

General means something that applies in lots of different places/situations. I'm not sure memes about religious enlightenment applies in lots of places/situations; they seem to be dependent on weird states of mind and for weird purposes.

The aspect of abstract that is most relevant is probably avoiding excessive detail. Detail is expensive to transmit, so ideas with very brief accurate descriptions are better at spreading than ideas that require a lot of context. But this is not the only aspect of abstractness, as abstractness also tends to be about something only being a thought and not having concrete physical existence.

[-]Leon Lang3y*30

Now to answer our big question from the previous section: I can find some satisfying the conditions exactly when all of the $X_{i}$ ’s are independent given the “perfectly redundant” information. In that case, I just set $Λ^{*}$ to be exactly the quantities conserved under the resampling process, i.e. the perfectly redundant information itself.

In the original post on redundant information, I didn't find a definition for the "quantities conserved under the resampling process". You name this F(X) in that post.

Just to be sure: is your claim that if F(X) exists that contains exactly the conserved quantities and nothing else, then you can define $Λ^{*}$ like this? Or is the claim even stronger and you think such $F$ can always be constructed?

Edit: Flagging that I now think this comment is confused. One can simply define $F (X) = P (X^{\infty} ∣ X)$ as the conditional, which is a composition of the random variable $X$ and the function $F : x \mapsto P (X^{\infty} ∣ X = x)$

[-]bhishma3y30

. Before the toddler ever hears the word,

It goes even back for certain visual stimuli

We examined fetal head turns to visually presented upright and inverted face-like stimuli. Here we show that the fetus in the third trimester of pregnancy is more likely to engage with upright configural stimuli when contrasted to inverted visual stimuli, in a manner similar to results with newborn participants. The current study suggests that postnatal experience is not required for this preference.

https://www.cell.com/current-biology/fulltext/S0960-9822(17)30580-8#secsectitle0015

[-]Ben Amitay3y30

I have similar but more geometric way of thinking about it. I think of the distribution of properties as a topography of many mountains and valleys. Then we get hierarchical clustering as mountains with multiple tops, and for each cluster we get the structure of a lower dimensional manifold by looking only at the directions for which the mountain is relatively wide and flat.

Of course, the underlying geometry and as a result the distribution density are themselves subjective and dependant on what we care about - pixel-by-pixel or atom-by-atom comparison would not yield similarity between trees even of the same species

LESSWRONG
LW

LESSWRONG
LW

53

The "Minimal Latents" Approach to Natural Abstractions

53

Ω 16

53

Ω 16

Background: The Language-Learning Argument

What We’ll Do In This Post

Background: Latent Variables

“Minimal” Latents

The Connection to Redundancy

The Connection to Information At A Distance

Weakening the Conditional Independence Requirement

Takeaways