Neural Dimensionality — LessWrong

Neural Dimensionality — LessWrong

Background

In a raw, newly trained LLM you will have a haphazard structure, where the exact same block of text may map onto lots of different neural activation patterns in the first layer of your LLM. Different seeds may lead to radically different activation patterns. This may be partly based on context that may lay outside the text itself, and partly based on lingual patterns from many, many humans. Different humans might e.g. respond to the very same question.

However: For the human brain I believe there are certain contextual labels that every block of text can be put in relation to. One might view these contextual labels as a kind of dimensionality. If we view the human brain, there are certain key contextual labels that are absolutely essential to us. I will now name a single binary scale, I think is of utmost importance, as a human attribute meaning to words and combinations of words:

Externa/internal origin

To what degree does a particular combination of words map onto a meaning that can be described as internal fantasy, vision or idea? To what degree does it map onto something that has been learned and generalized about the external world?

Each combination of words, that map onto meaning, may also be divided into subsets of words. I propose each set, and subset, of words, that form a meaningful combination, will be placed somewhere along the External/internal origin-axis.

A complex set of words will likely place somewhere between the extremes most of the time, probably rather firmly located.

A singular, tiny subset of words, will likely represent a range of possible positions. The subset of words can have many possible values among the axis, depending on what other subsets they are embedded within.

In essence: The more complex the set of words, the more definitely localised the position on the external/internal origin-axis will be. This may be a bit reminiscent of how the position of a macroscopic object is quantum mechanically well defined, while the position of a subatomic object is smeared out, until it is actually measured.

The point being: For a human, keeping track of the external/internal dimension is a question of life and death. Inability for internal thought makes any kind of communication impossible. Inability to distinguish between what is internal or external is a problem of schizophrenia. Thinking “I can fly!”, and interpreting this thought as of “external origin” could be a recipe for death.

For living things, this is a simplification, since most organisms do not appear to think in words, yet the analogy may still hold as a simplified model on human cognition, and it may guide how we think about LLMs, in a way that may lead to designs of useful epistemic structure. Let us for now pretend this simplified model is useful, and see where that might lead us.

Different contextual labels

It is striking that studies with both humans and animals show commonalities. It would appear as if concepts like lies/truth and fair/unfair are deeply engrained in the mammalian brain. Squirrels deceive one another by pretence-hiding acorns, if watched. Animals grooming one another keep score: Who does not return favours?

When formal language arises inside a brain, I think these conceptual labels comes along, providing structure and meaning. I would assume there are many different labels. Some of them are orthogonal to one another, some will run partially parallel, like e.g. good/bad versus useful/useless. I think this kind of redundancy create a robust, epistemic structure inside of the brain.

All lingual constructs imaginable, will paint some kind of “picture” (perhaps occupy a multidimensional probabilistic volume) inside of this epistemic scaffolding. Furthermore, the very scaffolding itself, shifts as lingual knowledge evolves and shifts over time. Nothing is static, inside a neural network of perpetually updating weights and biases.

Conceptually I imagine one might view each binary label axis as ranging from -1 to +1 with 0 representing a mixed bag. I also think these label axises will be evaluated for applicability, but for a man-made neural network I think that would be overkill (at least initially). In our cases, I think a 0 may also mean “not applicable”. Let’s list a few possibilities for axises:

Origin Dimension (External ↔ Internal)
Epistemic Dimension (Certain ↔ Uncertain)
Temporal Dimension (Historical ↔ Predictive)
Value Dimension (Bad ↔ Good)
Utalitarian Dimension (Useless ↔ Usefull)
Appreciative Dimension (Ugly/dissonant ↔ Beautiful/harmonic)
Knowability Dimension (Chaotic ↔ Pattern bound)
Scope Dimension (Local ↔ Universal)
Internal Scope Dimension (Objective ↔ Subjective)
Factual Dimension (False/deceitful ↔ True/truthful)
Choice Dimension (Intentional ↔ Unintentional)

You need to know. From where does this come. AI struggle with this today. We even have a name for it: hallucinations.

This dimension, in practice, is “what knowledge is strong enough for me to put my life on the line”. Not always that dramatic, but there is a huge evolutionary pressure for this axis to be well-calibrated.

The brain can be described as a prediction machine, keeping track of the past in order to make future predictions, guiding it through time and space. Without this quality, brains are almost useless. Here the 0 (the present) is of utmost importance.

This is kind of a gut-feeling dimension. It extend beyond the individual, forming the foundation of moral judgements. I reckon this quality is important for long term, species level survival.

This is a logically calculating dimension. This dimension always relates to specific goals (one or many). If none are named, context will still dictate some. Consider the tiny lingual construct “a chair”. It is clearly useful (lots of use cases, readily come to mind). Now consider “a flower”. Kind of useless, yet useful in relation to specific goals.

This appreciative dimension guides human instinct. It is highly linked to complex patterns, being “just right”, and a sense of calmness, safety and control. Ironically, human progress is much more linked to a hunt for beauty than a hunt for utility, as far as I know. It is large in scope, linked to social, scientific and internally harmonic linguistic patterns.

This dimension is highly parallel to the appreciative dimension. It is closer to logics than instinct. Humans perpetually seek for patterns, touching upon the main function of the brain: predictability. We are prone to over-generalization, finding patterns where there are none. The cost is often greater for missing patterns that are there, than to preliminarily assume patterns that are not real.

In a way, particular is a better word. However, local is a more human term. Local is a more evolutionary poignant term than particular. This is likely NOT just a detail. The closer we get to an actual human dimension, the more likely we are to find that dimension cleanly represented within human language.

This axis keeps track of universality from a human perspective. True for me, versus true for everyone. Or in other words: Internal reality, versus external reality. This axis, in truth, intersects and interacts with almost every other axis.

For a social animal, this is an absolutely crucial dimension. As mentioned above, even squirrels use it. It is cited that humans lie 1-2 times a day, and we may be lied to tens of times in a day. Being able to keep track of this in our internal lingual landscape must be extremely complicated. I would postulate this may be among the most robust, and resource demanding dimensions we have.

This is another dimension, crucial for a social animal. In fact, we even overshoot, mots of the time. Captured by e.g. the quote “Never attribute to malice that which can be adequately explained by neglect”. The brain is primed to find intentions, even when there are none. This may make the dimension a bit risky. It may be supremely useful in an AI. It may be highly problematic.

This is not a comprehensive list. However, in discussing the conceptual axises above, we can see that some axises may be likely to adhere closely to reality, while others may be highly biased. If an axis is biased, this is because evolutionary survival is enhanced by such a bias. Will this be a challenge in developing a LLM structured in relation to axises with high level of bias in a human brain? Surely, yes. Can there still be value? Also, surely yes.

A few words on beauty

Humanities search for beauty and harmony seems to me to be the very pillar of culture and science. Language itself, is shaped and crafted through our sense of beauty, over the ages. The interplay between beauty and dissonance. Why do mathematicians investigate certain patterns, but not others? Is it the hunt for usefulness? Noone knowledgeable in science history would give that answer. It is, more than anything else, the hunt for beauty.

One of the most ironic examples must be G.H. Hardy! In "A Mathematician's Apology" Hardy boasted that his work in pure number theory had no practical application whatsoever and never would. It may be one of the dumbest statements ever, and Hardy was by no means a fool. That very work turned out to be foundational for subjects like cryptography and quantum mechanics.

Hardy famously said "Beauty is the first test: there is no permanent place in the world for ugly mathematics."

And Hardy was not alone. Newton, Darwin, Cavendish, Copernicus, Kepler, Einstein... utility didn't drive them. Beauty did. No one fantasises about “riding a beam of light” out of utility. It is the crazy hunt for beauty, giving us things like relativity, the periodic table, Feynman diagrams, neutrinos, Turing machines and antimatter. All are objectively true and useful (in a sense). All was added to our collective knowledge through the hunt for beauty.

Perhaps beauty is our instinctive urge to find order in chaos. It is supremely well calibrated and useful. Mystically so, yet it may have been meticulously crafted and refined through evolutionary forces over geological timescales.

We do not have to speculate about why. It is enough to humbly acknowledge it is true: Beauty drives science, art and culture. It is as if beauty is the very human instinct, driving us to reach for the stars. In beauty you will find the instinct for what humans truly care about. What we aspire to, what insights that are proximal to our current base of knowledge, our current civilisational self-organisation.

Let me be crystal clear: Can we reach AGI, without making use of human concepts of beauty? Surely, yes. Can the wisdom in our sense of beauty help? Almost definitely. Our sense of beauty is a bit like a free alignment parameter. Robust, and crafted through the ages. Readily available to decipher from the sum of human knowledge and text.

Beauty may not be the first axis to explore, but it may be among the most important ones, if we want to reach a well-aligned AGI.

Beauty and meaningful reducibility in Rulial Space

This may be a tangent, but like many before me I find it fascinating to hear Stephen Wolframs ideas on computational irreducibility, computational reducibility and Rulial Space. Wolfram has spoken about the problem a AGI might face: Why would an AGI focus on finding reusability in Rulial Space correlating to anything we as humans might find meaningful? How would it know what to research?

In the context of Wolframs framework, I propose that Beauty is the meta-heuristics evolution has bestowed upon us to answer this:

“What parts of Rulial Space may be beneficial to explore in order to find valuable computational reducibility?”

This, in essence, is the role of beauty. Even something banal, like the peacocks’ tail feathers, is deeply connected to patterns of extreme value to us. The peacock flaunts a useless beauty. Why? Yet, humans have always done this as well. Created art, accessories. Displaying our wealth and craftsmanship. Richard Dawkins finally had an epiphany, reflecting on the peacocks’ tail: What if it was just bragging for sexual selection? “I’m so fit, I can afford carrying a tail like this!”

Beauty tells us, there is something worthy of exploring and decoding here, be it for art, language, dancing, hunting or science. Something valuable… something beautiful.

Beauty may not always lead anywhere, yet I find it highly likely that beauty will guide AI towards higher alignment with human perspectives on meaning and relevance. It will guide AI-driven original research, just like it has guided research done by humanity.

How do we do this, mathematically, for an LLM?

Let me quickly flesh it out, conceptually. OBS! I do this in a very simplified way. If this ever grow into a mature field, the crude ways of doing things exemplified below will in all likelihood not be used. This is just meant as a proof that what I am suggesting CAN be done. There is nothing stopping us from running this experiment, starting tomorrow. Let us start:

Let us initially identify a small number of axises that seem to adhere closely to reality in human language. We choose N such axises (with N being a parameter to vary in a serious Proof of Concept-study), e.g. *External ↔ Internal, *Historical ↔ Predictive and *False/deceitful ↔ True/truthful

Any set of words will have a “place” along each axis (from -1 to +1 with 0 being "mixed bag" or "dimension not applicable"). More technically: this “place” will be a mathematical probability density. Single words will have a smeared-out probability density, while more complex linguistic meaning bearing constructs will have a more localized probability density.

This can be analysed consciously. However, over time, it will get automatic, built into the very fabric of the network.

This is how I think a LLM could be trained for this internalisation.

*At a certain point in training, you activate your N chosen axises.

*During each step in training, the AI has a reference text, and predict the next token.

*How well it predicts the next token is linked to a loss function.

*Through backpropagation, weights strengthening probability for correct token get strengthened, while weights associated with wrong predictions get weakened.

*I propose we call this loss function the primary loss function, and conceptually, the token loss function (TLF).

*I propose we now introduce a secondary loss function, the epistemological loss function (ELF).

*The LLM will self-diagnose the input along each of the N axises (according to its current understanding). Say it gives a single number. A well trained LLM is not bad at this, given a well-crafted evaluation instruction (e.g. a well-crafted instruction to evaluate input truthfulness).

*The AI will also have a first-layer activation and a last layer activation of the very key concepts that are used as conceptual axises. If there is a complete neural overlap between the neural activation for the concept of “truth” and the neural activation of the input, while there is a complete separation between the input activation and the activation for the concept of “falsehood”, the structural value along the factual axis would be a 1.

*The simplest estimation of “overlap” could be the number of overlapping neural activations divided by the maximum number of neural activations (if the concept of “truth” activates 1000 neurons, and the input activate 200 000, and the number of overlapping activations are 200, the fraction would be 200/1000). A programmer might write this as “overlap_fraction = overlap_activations / min(concept_activations, input_activations)”.

If you now take “truth activation fraction” – “falsehood activation fraction” you would get a number between -1 and +1.

*You will get two numbers for the first layer: The active evaluation, made through a prompted question for the LLM, and the structural evaluation (calculated e.g. as suggested above). In the same way you will get two numbers for the last layer. How well each pair of numbers align, will determine the level of epistemological loss along this axis.

*When epistemological loss has been calculated along all axises, we combine the result into a measure of loss that we design to be between 0 and 1. Now we let the TLF range between 0 and M, with M >> 1.

*Instead of just backpropagating to minimize the TLF we now backpropagate to minimize the combined loss (TLF plus ELF).

Simplified mathematical completeness:

We define epistemological alignment by comparing a model’s self-evaluation with its structural activation patterns on a set of epistemological axes. Let us assume:

- The model encodes N conceptual axes, such as Truth vs Falsehood, or Internal vs External.
- For each axis i, the model:
- Produces a self-evaluation score Ei ∈ [−1, 1] (e.g., via prompting).
- Exhibits last-layer activations for:
- The positive concept apᵢ (e.g., “truth”),
- The negative concept anᵢ (e.g., “falsehood”),
- And the target output o being evaluated.

Let:
- activation(x) = the set of activated neurons when processing x,
- n(x) = number of active neurons in activation(x),
- o(x, y) = number of neurons in the intersection activation(x) ∩ activation(y).

We now define the structural activation score Aᵢ ∈ [−1, 1] on axis i as:

Aᵢ = o(apᵢ, o)/min(n(apᵢ), n(o)) − o(anᵢ, o)/min(n(anᵢ), n(o))

This represents the relative conceptual alignment of the output with the positive vs. negative poles of the axis. A high Aᵢ means strong alignment with the positive concept (e.g., truth); a low or negative Aᵢ implies alignment with its opposite.

We then define the Epistemological Loss Function (ELF) as:

ELF = (1/N) * Σ_i ((Aᵢ − Eᵢ)/(1 + |Eᵢ|))²

This formulation:
- Penalizes disagreement between structural activation and semantic evaluation,
- Softens the penalty when the model expresses epistemic uncertainty (|Eᵢ| small),
- Normalizes the loss to [0, 1], ensuring comparability across axes.

For clarity, consider an example:
- Concept “truth” activates 1000 neurons
- Output activates 200,000 neurons
- Overlap: 200 neurons
Then: o(ap, o)/min(n(ap), n(o)) = 200 / 1000 = 0.2

If “falsehood” overlaps 50 neurons with output and has 800 active neurons:
Then: o(an, o)/min(n(an), n(o)) = 50 / 800 ≈ 0.0625

Thus: A ≈ 0.2 - 0.0625 = 0.1375

If the model’s self-evaluation E = 0.3, then the ELF component is:
((0.1375 - 0.3)/(1 + 0.3))² ≈ 0.015

Finally, ELF is used as a secondary loss term alongside the standard Token-Level Loss (TLF) during training. It influences weight updates via backpropagation, albeit with lower weight, thereby encouraging alignment without overpowering token prediction.

What would happen if we do this?

In short: No one knows. This is why we conduct research. So, what MIGTH happen? I can see a few possibilities.

It is detrimental. If e.g. M is too low, we might get reward-hacking, where the AI moves its weight values until e.g. every input is considered “true”, and the “truth-concept” activate each neuron always, while the “false-concept” never activate any neurons. Predictions of the correct next tokens becomes totally secondary.
It hardly matters. If e.g. M is too high, we might get any temporary loss in token generation to overwhelm the significance of lowering the ELF.
We get a small positive epistemological benefit, while the effect on token generation is insignificant.
We get a large positive epistemological benefit, while the effect on token generation is insignificant.
The goldilocks scenario: As epistemological structure is gained, correct token generations will benefit as well. My hypothesis is that this is possible.

For the very first experiment, I suspect 1 or 2 will be the most likely outcome. However, even with the extremely simplistic definition of finding a conceptual activation values and concept evaluation values described above (using a somewhat sensible prompt instruction for the LLM evaluation), I think we might reach 3, 4 or 5, with a good value of M.

Make no mistake: This whole endeavour may be impossible. I believe it is not.

If it IS possible, crafting the optimal strategies for reaching epistemological, will likely be very hard. We are likely to see small gains (option 3 or 4) before we ever move into the option 5 territory. I hope I am pessimistic, though. Yet, I am quite confident: with optimal strategies (definitions, helper prompts, perhaps separate helper networks, perhaps subsets of human labled inputs), we will reach into 5 territory.

To me, the potential is obvious: We move away from a narrow, flat intelligence, and close in on something that more resembles a multi-dimensional human intelligence. A more general intelligence, if you will. Such a more general, structured, intelligence will very likely be better at even the token-prediction that was initially the only thing that we asked for.

Once we start reaching into the potential of option 5, the process will very likely be self-reinforcing. Better epistemological structure à Better token predictions à Potential for even better epistemological structure. And once we are here, it would be easy to introduce more and more axises.

At the neural level, I think the core concepts will tend to grow large in activation (since the core concepts relate to everything else). One might even find a slight overlap between core-concept “opposites”. One would certainly find overlap between concepts that are more parallel than orthogonal.

This is just a guess. Perhaps the opposite would happen: small well defined conceptual activations, with usually large input activations that touches upon the conceptual activations in suitable ways.

Either way you will have gotten a well-structured, well-functioning LLM, containing not just token-generative intelligence, but also structural intelligence.

Further potential of epistemological structure

The true value in epistemological structure may very well be found in the ease with which we can interact with the LLM. We can ask it to be truthful, and it will “know” what we mean. We can do this for all of the conceptual axises we have introduced.

Furthermore, since each concepts will have a “place” (in a multi-dimensional sense) we will be able to use seeds that nudge activations into truthfulness or falsehood. Consider this: The very same input could potentially trigger a true output, or a false. With epistemological structure, we can choose a seed well-suited for one, but not the other.

This goes for ALL the dimensions we have introduced. We will have a whole new level of control over the output-input pipeline for the LLM. We want fantasy output? No problem! We want to optimize for beauty? Sure! We want to keep the discussion general?

I will not repeat all of the 11 dimensionalities listed in the section “Different contextual labels”. I urge the reader to go back, and imagine the potential for themselves.