Are Embedding Spaces Adapting to Latent Manifolds? An Intuitive Idea on Understanding LLMs

Intrinsical-AI

LESSWRONG
LW

1 Are Embedding Spaces Adapting to Latent Manifolds? An Intuitive Idea on Understanding LLMs

by Intrinsical-AI

21st Jul 2025

5 min read

1

This post was rejected for the following reason(s):

No LLM generated, heavily assisted/co-written, or otherwise reliant work. LessWrong has recently been inundated with new users submitting work where much of the content is the output of LLM(s). This work by-and-large does not meet our standards, and is rejected. This includes dialogs with LLMs that claim to demonstrate various properties about them, posts introducing some new concept and terminology that explains how LLMs work, often centered around recursiveness, emergence, sentience, consciousness, etc. (these generally don't turn out to be as novel or interesting as they may seem).
Our LLM-generated content policy can be viewed here.

1

New Comment

Moderation Log

Intuitions

1) Embeddings as Projections
In natural language processing, embeddings are one of the most established strategies for representing tokens (words, subwords, etc.) in a vector space of dimension d. Each token is associated with a vector in ℝᵈ, encapsulating syntactic and semantic traits. If we think of each token as a “point” on a map, its position reflects its semantic relationships with others. Just as latitude and longitude define geographic coordinates, the components of the embedding encode linguistic meaning.

2) Semantic Proximity
Short distances in embedding space are interpreted as high conceptual similarity between tokens; large distances imply dissimilarity. This has proven useful in tasks like analogies or synonym detection. However, Euclidean distance in high-dimensional spaces is an oversimplification.

3) Dynamism During Training
As the LLM learns, the positions of these points are “redrawn.” This process is analogous to progressively refining a map as more accurate measurements become available.

4) Multi-Head Attention

A distinctive feature of the Transformer architecture—it can be conceived as a set of “mirrors” examining the context from different angles. Each head focuses on specific (semantically entangled) relations: syntactic dependencies, gender/number agreement, long-range semantic associations, etc.

Strong Intuition: The combination of views from multiple heads creates a “high-dimensional reconstruction” of the input. It's analogous to fusing multiple photographs of the same object to obtain a more faithful 3D model. The backpropagation that updates each head’s weights acts as a rotation of these “mirrors” or representational lenses, correcting biases or focusing errors.

How Does This Relate to LLM “Weird” Behaviors?

Transformers operate as complex, non-linear transformations applied to a set of fixed points (token embeddings), involving all-to-all interaction layers. The tokens matter. The order matters. The interactions matter. But what are these systems really doing?

My intuition is that LLMs project multiple entangled semantic "views" (akin to orthographic projections in technical drawing) into a shared internal latent space—an evolving geometric atlas of language.

Empirical Clues Supporting the Hypothesis

1. Emergent Abilities
Emergent abilities in LLMs may reflect the activation of bridges between previously distinct semantic subregions. For example, coding and mathematics show structural overlap; LLMs seem to exploit this geometrically.

2. Bias as Systemic Deformation
Biases manifest as geometric anomalies:

Sharp peaks: Overrepresentation of stereotypes collapses diversity into narrow clusters.
Artificial valleys: Underrepresented groups or themes are pushed to remote, low-density areas, making semantic integration harder.

3. Hallucinations as Low-Density Extrapolation
LLMs hallucinate when they wander into sparsely mapped regions.

Geometrically: These zones lack sufficient training samples; the model’s representation is poorly defined.
Empirically: Hallucinations spike in specialized scientific or historical domains with low corpus coverage.

4. Impact of Context Window on Semantic Navigation
The context window bounds how far a model can “travel” across the internal semantic surface.

Long trajectories: Reasoning chains require movement across distant topics. Larger windows allow deeper paths.
Fragmentation: Short windows break trajectories, forcing reinitialization in underspecified zones, leading to incoherence.

GPT-4 and similar models with longer context windows show clear gains in multi-hop reasoning, suggesting better traversal and memory of semantic paths.

5. A Tale of Two Cartographers

Cartographer A (Ordered): Trains with structured, hierarchical data. First maps global shape, then local detail. Result: a coherent, low-distortion internal map.
Cartographer B (Chaotic): Learns from noisy, unsorted data. Result: warped, discontinuous geometry with abrupt curvatures and semantic fractures.

6. Knowledge Topography: Valleys and Peaks

Valleys: Sparse or noisy regions prone to uncertainty and hallucination.
Peaks: Well-trained, high-density domains (e.g., law, medicine, math)—stable conceptual cores.

Together, these scenarios paint a picture of LLMs as semantic cartographers—not merely storing language, but actively shaping a multidimensional geometry of meaning. Peaks, valleys, fractures, and bridges in this internal landscape correspond to the model's strengths and weaknesses. While speculative, these ideas align with empirical observations and offer a potentially rich framework for understanding and improving LLM behavior.

Geometrical Hypothesis and Misaligment

We hypothesize that geometrically misshaped latent spaces—characterized by high-curvature self-patches, semantic dead-zones, or disconnected submanifolds—may correlate with persistent misaligned behaviors that survive fine-tuning or RLHF.

In low-density zones of the latent manifold, the model interpolates over insufficient training support. This creates structurally fragile predictions which can manifest as hallucinations or confidently wrong answers—aligned locally but misaligned globally.

Misalignment may not just emerge from optimization failures or reward misspecification, but from intrinsic geometrical bottlenecks that distort how meaning propagates across internal structures.

Existential Risks

Any decent LLM encodes information about itself. Every LLM has a region representing its own knowledge—and beyond that, a metamodel of itself. Encoded in its weights are its behavioral expectations, knowledge biases, and anticipated reactions. In my view, there must exist a set of activations that encode how the rest of the weights are distributed, activated, and implicitly why. Somewhere in the latent atlas lies a self-patch—a dense basin of activations that represents:

Beliefs about their own competence and typical failure modes.
Priors over how future tokens (or weight updates) will unfold.
Heuristics for steering attention when uncertainty spikes.

Why that matters? - Geometrical Lens

- Self-misrepresentation (over- or under-estimates its capabilities), geodesics bend toward low-density areas. Over-confident answers in sparse domains → authoritative hallucinations.

- Gradient hacking / core drift: a coherent, high-curvature self-patch can anchor future fine-tune gradients, preserving misaligned circuits . Persistent core survives finetuning / RLHF, re-emerges at deployment. It might even adaptively learn to obscure misaligned features during training—making them harder to detect yet still behaviorally present.

- Privacy & weaponisation The same geometric tools that allow for model interpretability—like density estimation, curvature mapping or topological compression—can be inverted. Adversaries can trace paths that leak hidden user traits or pinpoint vulnerability manifolds in code. Embedding deanonymization; automated OSS zero-day discovery.

LLM behavioural geometry is currently a blind spot. Pure corpus-level statistics cannot expose these geometric pressure points. We suggest that tools from differential geometry, information topology and thermodynamics may provide a principled scaffold for alignment. Rather than relying exclusively on behavioral data, we propose a structural perspective: aligning not just the outputs, but the shape and flow of internal knowledge.

OK, Nice. The Intuition Is Beautiful... But How Do We Test It?

If LLMs are probabilistic models, the latent representation—whether a clean and elegant manifold or an atlas of charts—should be defined accordingly. It’s not fixed, nor deterministic (at least in real-world environments), but a surface emerging from the cumulative probabilities of generation. A high-dimensional heatmap, or a probabilistic cloud defined by the function $p_{M} (v, t) .$

From this density function, we can derive an integral to measure “semantic density” in a given zone, compare it to others, and estimate, for example, bias or hallucination rates across regions.

Another useful mathematical tool involves graphs and discrete curvature—e.g., Ollivier-Ricci curvature. The idea is to maintain a "flat" metric (e.g., cosine) but apply curvature calculations to better estimate real distances on the surface—that is, the valid semantic trajectories between embeddings.

Contribution

I’ve developed a Python framework to explore these hypotheses. Still buggy, experimental, and imprecise:

👉 https://github.com/Intrinsical-AI/geometric-aware-retrieval-v2

Feel free to propose experiments and help proving this whole idea is nonsense. I’ll be trying to prove the opposite—so hopefully we’ll both learn something in the process.

Special thanks to the supporters who encouraged me to write more. Also to Gemini, o3, and DeepSeek for their assistance, but especially to everyone who spends their life sharing knowledge, making the creation of models like these possible.