Financial status: This is independent research. I welcome financial support to make further posts like this possible.
Epistemic status: These ideas are still being developed.
I am interested in recognizing entities that might exert significant power over the future.
My current hypothesis is that knowledge of one’s environment is a prerequisite to power over one’s environment.
I would therefore like a good definition of what it means for an entity to accumulate knowledge over time.
However, I have not found a good definition for the accumulation of knowledge. In this sequence I describe the definitions I’ve tried and the counterexamples that I’ve come up against.
The entities that currently exert greatest influence over the future of our planet — humans — seem to do so in part by acquiring an understanding of their environment, then using that understanding to select actions that are likely to achieve a goal. Humans accumulate knowledge in this way as individuals, and are also able to share this understanding with others, which has led to the accumulation of cultural knowledge over time. This has allowed humankind to exert significant influence over the future.
More generally, life forms on this planet are distinguished from non-life in part by the accumulation of genetic knowledge over time. This knowledge is accumulated in such a way that the organisms it gives rise to have a capacity for goal-directed action that is optimized for features of the environment that have been discovered by the process of natural selection and are encoded into the genome.
Even though genetic knowledge accumulates over many lifetimes and cognitive knowledge accumulates during a single lifetime, for our present purposes there is no particular need to distinguish "outer knowledge accumulation" from "inner knowledge accumulation" as we do when distinguishing outer optimization from inner optimization in machine learning. Instead, there are simply processes in the world that accumulate knowledge and which we recognize by the capacity this knowledge conveys for effective goal-directed action. Examples of such processes are natural selection, culture, and cognition.
In AI alignment, we seek to build machines that have a capacity for effective goal-directed action, and that use that capacity in a way that is beneficial to all life. We would particularly like to avoid building machines that do have a capacity for effective goal-directed action, but do not use that capacity in a way that is beneficial to all life. At an extreme minimum, we would like to have a theory of effective goal-directed action that allows us to recognize the extent to which our creations have the capacity to influence the future, so that we might make informed choices about whether to deploy them into the world.
The detection of entities that have a greater-than-expected capacity to influence the future is particularly relevant in the context of the prosaic AI regime, in which contemporary machine learning systems eventually produce entities with a capacity for effective goal-directed action that exceeds that of human society, without any new insights into the fundamental nature of intelligence or autonomy. In this regime, large-scale search processes working mostly by black-box optimization eventually produce very powerful policies, and we have relatively little understanding of how these policies work internally, so there is a risk that we deploy policies that exert greater influence over the future than we expect.
If we had a robust theory of the accumulation of knowledge, we might be able to determine whether a policy produced in such a way has the capacity to accumulate unexpectedly detailed knowledge about itself or its environment, such as a robot vacuum that unexpectedly accumulates knowledge about the behavior of its human cohabitants. Alternatively, with such a theory we might be able to detect the "in-flight" accumulation of unexpected knowledge after deploying a policy, and shut it down. Or we might be able to limit the accumulation of knowledge by deployed entities as a way to limit the power of those entities.
Understanding the accumulation of knowledge could be particularly helpful in dealing with policies that come to understand the training process in which they are embedded during the time that they are being trained and then produce outputs selected to convince the overseers of the training process to deploy them into the external world ("deceptive alignment" in the terminology of Hubinger et al). In order to behave in such a deceptive way, a policy would first need to accumulate knowledge about the training process in which it is embedded. Interrogating a policy about its knowledge using its standard input and output channels won’t work if we are concerned that our policies are deliberately deceiving us, but recognizing and perhaps limiting the accumulation of knowledge at the level of mechanism might help to detect or avoid deception.
Interestingly, in a world where we do not get prosaic AI but instead are forced to develop new deep insights into the nature of intelligence before we can build machines with the capacity for highly effective goal-directed action, investigating the accumulation of knowledge might also be fruitful. Among processes that converge towards a small set of target configurations despite perturbations along the way — say, a ball rolling down a hill, a computer computing the square root of two by gradient descent, and a team of humans building a house — it is only the team of humans building a house that do so in a way that involves the accumulation of knowledge. It might be that the central difference between systems that exhibit broad "optimizing" behavior, and the subset of those systems that do so due to the agency of an entity embedded within them, is the accumulation of knowledge. Furthermore, we might be able to understand the accumulation of knowledge without reference to the problematic agent model in which the agent and environment are separated, and the agent is assumed to behave according to an immutable internal decision algorithm.
In summary, investigating the accumulation of knowledge could be a promising line of attack on both the problem of understanding agency without presupposing a dualistic agent model, as well as the problem of detecting dangerous patterns of cognition in agents engineered via large-scale search processes. The key question seems to be: is knowledge real? Is knowledge a fundamental aspect of all systems that have the capacity for effective goal-directed action, or is it a fuzzy intermediate quantity acquired by some intelligent systems and not others?
This sequence, unfortunately, does not give any final answers to these questions. The next four posts will explore four failed definitions of the accumulation of knowledge and go over counterexamples to each one.
Suppose I show you a physically closed system — say, for the sake of concreteness, a shipping container with various animals and plants and computer systems moving about and doing things inside — and tell you that knowledge is accumulating within a certain physical region within the system. What does this mean, at the level of physics?
Or suppose that I show you a cellular automata — say, a snapshot of Conway’s Game of Life — and I point to a region within the overall game state and claim that knowledge is accumulating within this region. Without any a priori knowledge of the encoding of this hypothesized knowledge, nor of the boundary between any hypothesized agent and environment, nor of the mechanism by which any hypothesized computation is happening, can you test my claim?
Or even more abstractly, if I show you a state vector evolving from one time step to the next according to a transition function and I claim that knowledge is accumulating within some particular subset of the dimensions of this state vector, can you say what it would mean for my claim to be true?
I have been seeking a definition of knowledge as a correspondence between the configuration of a region and the configuration of the overall system, but I have not found a satisfying definition. In this sequence I will describe the attempts I've made and the challenges that I've come up against.
What a definition should accomplish
The desiderata that I’ve been working with are as follows. I’ve chosen these based on the AI-related motivations described above.
A definition should provide necessary and sufficient conditions for the accumulation of knowledge such that any entity that exerts goal-directed influence over the future must accumulate knowledge according to the definition.
A definition should be expressed at the level of physics, which means that it should address what it means for knowledge to accumulate within a given spatial region, without presupposing any particular structure to the system inside or outside of that region.
In particular there should not be reference to "agent" or "computer" as ontologically fundamental concepts within the definition. However, a definition of the accumulation of knowledge might include sub-definitions of "agent" or "computer", and of course it’s fine to use humans, robots and digital computers as examples and counterexamples.
The following are non-goals:
Practical means for detecting the accumulation of knowledge in a system.
Practical means for limiting the accumulation of knowledge in a system.
The failed definitions of the accumulation of knowledge that I will explore in the ensuing posts in this sequence are as follows. I will be posting one per day this week.
Direct map/territory resemblance
Attempted definition: Knowledge is accumulating whenever a region within the territory bears closer and closer resemblance to the overall territory over time, such as when drawing a physical map with markings that correspond to the locations of objects in the world.
Problem: Maps might be represented in non-trivial ways that make it impossible to recognize a map/territory resemblance when examining the system at a single point in time, such as a map that is represented within computer memory rather than on a physical sheet of paper.
Mutual information between region and environment
Attempted definition: Knowledge is accumulating whenever a region within the territory and the remainder of the territory are increasing in mutual information over time.
Problem: The constant interaction between nearby physical objects means that even a rock orbiting the Earth is acquiring enormous mutual information with the affairs of humans due to the imprinting of subatomic information onto the surface of rock by photons bouncing off the Earth, yet this does not constitute knowledge.
Mutual information over digital abstraction layers
Attempted definition: Knowledge is accumulating whenever a digital abstraction layer exists and there is an increase over time in mutual information between its high-level and low-level configurations. A digital abstraction layer is a grouping of low-level configurations into high-level configurations such that transitions between high-level configurations are predictable without knowing the low-level configurations.
Problem: A digital computer that is merely recording everything it observes is acquiring more knowledge, on this definition, than a human who cannot recall their observations but can construct models and act on them.
Precipitation of action
Attempted definition: Knowledge is accumulating when an entity’s actions are becoming increasingly fine-tuned to a particular configuration of the environment over time.
Problem: A sailing ship that is drawing a map of a coastline but sinks before the map is ever used by anyone to take action would not be accumulating knowledge by this definition, yet does in fact seem to be accumulating knowledge.
The final post in the sequence reviews some of the philosophical literature on the subject of defining knowledge, as well as a few related posts here on lesswrong.