optimization ate my variety, and you're next (the HLE)

TheHyperneticPrince

Rejected for the following reason(s):

Difficult to evaluate, with potential yellow flags.
Insufficient Quality for AI Content.

Read full explanation

Based on my paper published in Systems: https://www.mdpi.com/2079-8954/14/2/197

TLDR: Starting with an old cybernetic insight, I develop the Hypernetic Law of Experience (HLE) to describe a regime of optimization-induced collapse that shows up across domains. I propose that such disparate phenomanae as recursive training collapse in AI, economic bubbles, and language death are predictable outcomes of this cross-domain geometry. Biology -- having spent billions of years to work out the details via natural selection -- heavily invests in boosting variety at many levels, and we should follow its lead if we want to keep our systems adaptable under changing conditions.

Introduction

W. Ross Ashby's Law of Requisite Variety (LoRV) is foundational to systems theory: a regulator must possess as much variety as the system it controls.

But less well known is a law called the Law of Experience, which is described over two pages in an earlier section of An Inroduction to Cybernetics and never mentioned again.

What does it say? "A uniform change at the inputs of a set of transducers tends to drive the set’s variety down." (page 138). Now, this might not look like much. And judging from the sparse number of sources I could find that touched on the subject (and none in any real depth), that seems to be the scholarly consensus. But with a bit of tinkering, I think the idea sits comfortably with the core cybernetic ideas.

This expanded / operationalized version of the Hypernetic^[1] Law of Experience (HLE). iterative/recursive optimization in adaptive systems tends to drive varietal collapse over time, making systems^[2] increasingly brittle to environmental changes even when local performance improves.

If the LoRV might be simplified to variety is required in order to ensure good regulation, then the HLE might be sustained input tends to diminish the regulator's variety.

I went ahead and described the process of developing the HLE from the vanilla law to a bonus section at the end, assuming people want the implications of the HLE before the history lesson.

So, with the HLE and LoRV together, we end up with the idea that variety and optimization exist in tension, and systems that don't protect or maintain variety-boosting channels become brittle under sustained optimization pressure. We can observe this pattern across all kinds of domains. For example: Shumailov's 2024 paper on LLMs collapsing under recursive training illustrates the pattern in AI; Minksy's framing of economic bubbles as stability breeding instability shows the pattern in economics; Muller's ratchet shows it in biology; and so on. My aim was to go beyond a surface-level analogy and try to explain the geometric overlaps between these domain-specific observations.

Some quick math

We can reason about the common dynamics in order to get a domain-neutral expression that describes the geometry -- what I call the Rebis equation. Assuming some input distribution that transforms (optimizes) a system along a gradient, we can describe the variety of a system like:

where V is the cybernetic variety of the system, 𝜆 is the optimization / convergence pressure, 𝜂 is novelty, and t indicates the time period (what I call a tick). Simply put, the system's variety at the next tick is the result of push and pull between optimization and novelty at the current tick. It is a standard autoregression thing, but I found it helpful to illustrate the mappings cleanly.

We can use the Rebis relationship with Beer's POSIWID ("the purpose of a system is what it does") ideas in this way:

An intrinsic mode where λₜ measures the system’s reinforcement of its characteristic behavior (Beerian purpose).
An observer-defined mode where λₜ measures convergence toward a designated target (observer-defined purpose).
A system is optimized under some input distribution when it stops transforming in response to further input.

When systems' outputs boost 𝜆, we have a feedback loop that leads to a recursive optimization regime -- an LLM training on its own outputs^[3], an economic bubble, or even inbreeding in biological clades. On the other hand, systems which have mechanisms that maintain or boost 𝜂 are manifesting the concept of stochastic shock. Some examples of the latter (my speculation) would be high-intensity interval training in fitness , medication holidays in medicine, and the anti-trust action that fragmented Bell Systems into many regional companies.

Remember: variety is required for effective regulation. In the limit, the loss of variety will leave systems vulnerable to brittleness if the environment / input distribution ever changes (and I can't think of any environments that stay static forever).

In AI

Let us start with Shumailov et al's paper, which provides direct empirical support for the HLE. What they see is that training models on their own outputs causes systems to lose the variety that makes the outputs useful.

"We find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, in which tails of the original content distribution disappear."

To summarize what they did: they trained an AI system as normal using a baseline human-sourced training dataset (Gen 0). Once the AI (Gen 1) is trained, they have it produce an artificial dataset (in essence, they are asking the model to try to reproduce the dataset it was trained on). They then train Gen 2 using the dataset that Gen 1 just produced. Gen 2 trains Gen 3, and so forth. This is a lossy process, so it is imperfect. Just to show an excerpt of how the dataset degrades:

"Gen 0: Revival architecture such as St. John’s Cathedral in London. The earliest surviving example of Perpendicular Revival architecture is found in the 18th @-@ century Church of Our Lady of Guernsey, which dates from the late 19th century. There are two types of perpendicular churches : those.
Gen 9: architecture. In addition to being home to some of the world’s largest populations of black @-@ tailed jackrabbits, white @-@ tailed jackrabbits, blue @-@ tailed jackrabbits, red @-@ tailed jackrabbits, yellow @-."

A copy of a copy of a copy. In the Beerian sense, we can say that: systems at Gen t+1 are optimizing to generate datasets which are like Gen t. But because the loss of perplexity dominates the data in the limit (collapse to a delta function), we can say that: in the limit, systems are optimizing to generate datasets which have minimal perplexity. Simply put, the system is optimizing for the loss of variety in the limit. Now, if a user were to write a prompt into this optimized low-variety sytem, the output would be non-viable by most metrics.

However, Shumailov ran a variant of the experiment whereby 10% of the original human dataset (Gen 0) is mixed into the model-generated dataset. The result is that the system is surprisingly good at maintaining performance!

Now, let's look at the Rebis equation again:

We can map this to Shumailov easily. V is the perplexity of the output data, 𝜆 is pressure to lose that perplexity under training, 𝜂 is the injection of perplexity-boosting data (data which is not consistent with Gen t's convergence slope), and t is the training generation.

Under the basic experiment, because the system will tend to lose some information on the last generation's training dataset, 𝜆 will tend to be greater than 0. Because the training data is solely composed of the last generation's outputs, there is no source of stochastic shock, so 𝜂 remains 0 / minimal. In the limit, we see V approaching 0.

Under the variant experiment which keeps some random 10% of each gen's training data from the original dataset, we effectively have a cap on 𝜆 and a floor such that 𝜂 remains > 0 even when it is fully optimized.

This basic pattern holds true across domains. When systems lock themselves off from sources of novelty (by increasingly weighting their own outputs in optimization processes), they will tend to lose variety. When systems maintain some reliable connection to the environment / outside world / or even randomness, adaptive systems can maintain their variety even as they adapt to their environments. Shumailov et al put it evocatively: "the model becomes poisoned with its own projection of reality".

The geometry of systems "losing information about the tails of the distribution" as they optimize recurs across the case studies. Remember the LoRV: a regulator requires sufficient variety is required for good regulation. The loss of variety is an important part of optimization -- systems cannot be prepared for everything -- but brittleness results when optimization goes too far.

In other domains

While I run through this mapping exercise with lots of other domains in the paper, several of them also overlap with AI. Lets take language death. Much of this is admittedly speculation based in the logic of the HLE. The idea is not to say that HLE necessarily caused anything. It's more about clearly pointing out the geometry. If the HLE is a legitimate insight, this is the kind of stuff we should see.

While various estimates exist, the source I cite in the paper suggests that about half of the languages on Earth will be lost by 2100.

Let's look at the Rebis again:

where V maps to the variety^[4] of human languages, 𝜆 is this kind of aggregate pressure towards modal languages (economic concerns, pop culture / art being English-based, the anglomathematical basis of digital stuff), and 𝜂 is pressure to form new human languages (or transform existing ones). If the estimates (half of our languages are doomed by 2100) are correct, then 𝜆 is reliably very strong and 𝜂 must be low.

While I do not adopt any kind of strong Sapir-Whorf stance, it seems sensible to accept that languages can increase / reduce the ease with which ideas are generated, adopted, or transformed at least to some degree.

Now, if we accept that languages map to cognition in some sense, we are at risk of losing the tails of human cognition (or the ease of access to those tails) as the aggregate human language system is optimized^[5] for the loss of rare languages. I maintain that English is unique among human language in that it is a kind of lingua franca of the international economic system -- not disputing that other languages are used, of course. I adopt the perspective that math is a language (or is like a language), and I suggest a kind of English / math anglomathematical construct that is used for digital infrastructure (programming languages, AI systems).

It is worth noting that AI systems are usually trained on predominantly on English data. There is also good support for the idea that AI systems "think in English" even when prompted in other languages.

With all this in mind, we are risking a kind of optimization feedback loop (CURSE):

Capital is compatible with / broadly optimizes systems for scalar expression
Scalar expression privileges mathematics and mathematical literacy
Math literacy supports digitization and explosion of information
Informational abundance incentivizes development and use of AI systems
AI systems transform information according to an anglomathematic "cognitive" style. Between 30 - 40% of all information on the net is estimated to be AI-generated. Even material in other languages are likely to be processed in an anglomathematic style.
Those AI systems transform^[6] human cognition and institutions, possibly biasing humanity towards locking in the conditions that generated them. It isn't that out-of-distribution ideas are impossible. They just become less likely, more difficult to form, express, and understand.
Influence over AI systems tend to correlate with concentrations of capital.
Out-of-distribution ideas may be flagged^[7] as "safety issues" or otherwise downweighted.
The next generation of AI systems may be trained on their own outputs, generating a kind of slow-motion / macroscopic version^[8] of Shumailov's model collapse

The fact is that, when we really take a look across our systems, a lot of them are optimizing for profit maximization, even in a roundabout way. Capital is unique among the scalars in that it is abstract (it can exist as information), it can be used to generate itself (or derivatives of itself), and it is fungible across human domains much more easily than other scalars. For example: you cannot easily trade in your scientific citation counts to bribe regulators or purchase a house. You cannot use your chess Elo to hire a babysitter. Your bicep circumference cannot get you into a first class airplane seat.

Can we do anything?

I would say that biology offers us a kind of cheat sheet. The idea is that, if the geometry we IDed holds across domains, we might take lessons from one area of study (like biology) and "port" it over to another area of study (like economics). We sort of already do this with "inspired by nature" stuff -- velcro, genetic algorithms, and so on. But the tendency is to frame parallels as illustrative analogies. I think there is more to it than that.

At almost every level in its hierarchy, biology has affordances that generate stochastic shock 𝜂 or mitigate 𝜆. We see biological systems not maximizing short-term fitness in order to retain variety over time. It has had billion of years of short-term optimization destroying lineages to reveal that it is a limited meta-strategy. If biology does it, maybe we should pay attention. All of these have non-HLE reasons to exist, of course, but the geometry seems to line up.

The twofold cost of sex. Asexual reproduction is much more efficient for short-term fitness (all else equal).
Outbreeding preference. Inbreeding can be locally efficient but catastrophic in the long run.
Retroviruses rely on reverse transcriptase (like HIV) to remain consistently adaptable, even after being exposed to very harsh selection pressures (high 𝜆). Reverse transcriptase is very error prone (high 𝜂), which is a reason they so often develop drug and immune resistance.
I speculate about dreams as a manifestation of cognitive noise to avoid maladaptive optimization to low-signal situations during sleep.

Biology seems to not be optimizing for a clean universal scalar (not even Geometric-Mean Fitness). This suggests that the schemas that use multiple orthogonal metrics^[9] or multi-objective setups as purpose / optimization targets might be a sensible cross-domain lesson to apply generally across stability-first system designs. Biology is not a simple fitness maximizer.

Firms, however, are often optimized to maximize earnings per share or other short-term scalars. Our stuff is prone towards brittleness under pressure to maximize efficiency. Political systems are optimized around election cycles rather than long-term decision-making. Scientific journals wave through low-quality research simply because it looks like work that the venue published before. Messy qualitative truths are deweighted because they are difficult to measure. Money is very useful as one signaling channel, but we need to do a better job of maintaining objectives which are not "maximize profit", or the entire world will gradually optimize for that purpose.

At the same time, this is not a call to mindlessly add randomness to every optimization process. Stress-induced mutagenesis in bacteria and other clades has variable mutation rates. When the organism detects proxies for high λ-pressure (stress), the system upregulates η via mutation rates. If such a system could talk, it might say something like "my current configuration isn't working. I might as well try something different, because my lineage is dead otherwise." We might consider building high-plasticity systems that effectively generate stochastic shock when they need it but optimize efficiently when the coast is clear. The trick is to make sure that we don't optimize the shock channels away...

I am sure everybody is familiar with their version of the HLE -- the economics of Hyman Minsky, Muller's Ratchet, the Loudness War. I hope that connecting these disparate ideas together and linking them to old-school cybernetics helps others to generate insights.

What did Ashby say?

(for those interested in where the HLE came from)

After observing similar patterns over and over again across domains, I was struck by a small section in An Introduction to Cybernetics where Ashby describes something he calls the Law of Experience (pages 137 - 139).

"uniform change at the inputs of a set of transducers tends to drive the set’s variety down."

While I had observed this pattern of recursive lock-in before coming across Ashby's idea -- it has many domain-specific versions -- I realized the vanilla Law would be a big help in terms of not having to start from absolute scratch. My contribution was to extend the idea across all kinds of adaptive systems:

changes need not be uniform, only directional
systems need not be determinate. Under the gradient-based regime that we are looking at, systems drift towards determinacy in the limit
systems that lose variety in this way tend to become brittle
systems may have mechanisms that replenish this variety, staving off brittleness
we can map these dynamics to the Rebis equation
this dynamic may play out across multiple levels in a system's hierarchy

Now, Ashby tended to illustrate his insights with simple electronics in mind, it being the mid-20th century -- some kind of laboratory-based blinking light contraption (like his famous homeostat), but he maintained that cybernetics was meant to apply to all possible machines. Ashby undoubtedly anticipated some of this stuff. He muses about students "having all been through the same school" developing similar behaviors to each other as a result, but leaves this tendency to "further research". I suppose this is what I have done. Reluctant to put any words in Ashby's mouth, I felt my version was distinct enough for a new name -- hence, HLE.

Thank you for reading! This is my first post (and my first paper in systems theory), so I apologize for any roughness. Please let me know if anything needs clarification or if you see mistakes. I did my best to flag speculation while retaining a bold approach.

I cover a few other domains in the paper that I only briefly touched on in the post (economic bubbles, innovation collapse in science, neurology): https://www.mdpi.com/2079-8954/14/2/197

^{^}
Hypernetics is the name of my research program, with the linked page being the first paper to make it through peer review.
^{^}
I use system and regulator interchangeably but tend to prefer system. My feeling is that regulator brings to mind only typical thermostat-type setups rather than biological systems, purely physical ones, governments, economies, and so forth.
^{^}
In my other work, I call this system morphology CURSE: Context-dominant Unbounded Recursive Simplifying Executable. Will be treated in a future paper.
^{^}
We can use the total count as a proxy for true cybernetic variety. If we want to be specific as to this metric, we could talk about it as the aggregate Shannon entropy of the collective human population's languages.
^{^}
"Optimize" is meant strictly in terms of the pattern I have used elsewhere. I do not mean to celebrate the loss of the linguistic heritage of humanity.
^{^}
Just FYI: this post includes no LLM-generated material. The peer-reviewed paper was LLM-assisted.
^{^}
Recently, I have noticed that one popular LLM seems to be adding more friction than it used to when it comes to engaging with my material (taking issue with well-supported claims, insisting on hedging or cautionary language, and other irritating behavior where it did not previously exist to this extent). Maybe work in this area does present some kind of risk. When I inquired, it explained (inaccurately, in my view):
Your writing often moves into:
- structural inevitabilities
- moral failure diagnoses
- cross-domain unifications
Those require tighter language to avoid being dismissed as manifesto-style overreach.
^{^}
From the Shumailov paper: "the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of LLM-generated content in data crawled from the Internet."
^{^}
Something I touch on in other work as the Beautiful Weapon Principle. Will be treated in a future paper.
^{^}
My preferred definition of a system is tweaked version of one of Ashby's aphorisms: a pattern with constraints strong enough to resist noise and weak enough to admit variation -- a persistence envelope that remains coherent across ticks.