Financial status: This is independent research. I welcome financial support to make further posts like this possible.
Epistemic status: These ideas are still being developed.
I am interested in recognizing entities that might exert significant power over the future.
My current hypothesis is that knowledge of one’s environment is a prerequisite to power over one’s environment.
I would therefore like a good definition of what it means for an entity to accumulate knowledge over time.
However, I have not found a good definition for the accumulation of knowledge. In this sequence I describe the definitions I’ve tried and the counterexamples that I’ve come up against.
The entities that currently exert greatest influence over the future of our planet — humans — seem to do so in part by acquiring an understanding of their environment, then using that understanding to select actions that are likely to achieve a goal. Humans accumulate knowledge in this way as individuals, and are also able to share this understanding with others, which has led to the accumulation of cultural knowledge over time. This has allowed humankind to exert significant influence over the future.
More generally, life forms on this planet are distinguished from non-life in part by the accumulation of genetic knowledge over time. This knowledge is accumulated in such a way that the organisms it gives rise to have a capacity for goal-directed action that is optimized for features of the environment that have been discovered by the process of natural selection and are encoded into the genome.
Even though genetic knowledge accumulates over many lifetimes and cognitive knowledge accumulates during a single lifetime, for our present purposes there is no particular need to distinguish "outer knowledge accumulation" from "inner knowledge accumulation" as we do when distinguishing outer optimization from inner optimization in machine learning. Instead, there are simply processes in the world that accumulate knowledge and which we recognize by the capacity this knowledge conveys for effective goal-directed action. Examples of such processes are natural selection, culture, and cognition.
In AI alignment, we seek to build machines that have a capacity for effective goal-directed action, and that use that capacity in a way that is beneficial to all life. We would particularly like to avoid building machines that do have a capacity for effective goal-directed action, but do not use that capacity in a way that is beneficial to all life. At an extreme minimum, we would like to have a theory of effective goal-directed action that allows us to recognize the extent to which our creations have the capacity to influence the future, so that we might make informed choices about whether to deploy them into the world.
The detection of entities that have a greater-than-expected capacity to influence the future is particularly relevant in the context of the prosaic AI regime, in which contemporary machine learning systems eventually produce entities with a capacity for effective goal-directed action that exceeds that of human society, without any new insights into the fundamental nature of intelligence or autonomy. In this regime, large-scale search processes working mostly by black-box optimization eventually produce very powerful policies, and we have relatively little understanding of how these policies work internally, so there is a risk that we deploy policies that exert greater influence over the future than we expect.
If we had a robust theory of the accumulation of knowledge, we might be able to determine whether a policy produced in such a way has the capacity to accumulate unexpectedly detailed knowledge about itself or its environment, such as a robot vacuum that unexpectedly accumulates knowledge about the behavior of its human cohabitants. Alternatively, with such a theory we might be able to detect the "in-flight" accumulation of unexpected knowledge after deploying a policy, and shut it down. Or we might be able to limit the accumulation of knowledge by deployed entities as a way to limit the power of those entities.
Understanding the accumulation of knowledge could be particularly helpful in dealing with policies that come to understand the training process in which they are embedded during the time that they are being trained and then produce outputs selected to convince the overseers of the training process to deploy them into the external world ("deceptive alignment" in the terminology of Hubinger et al). In order to behave in such a deceptive way, a policy would first need to accumulate knowledge about the training process in which it is embedded. Interrogating a policy about its knowledge using its standard input and output channels won’t work if we are concerned that our policies are deliberately deceiving us, but recognizing and perhaps limiting the accumulation of knowledge at the level of mechanism might help to detect or avoid deception.
Interestingly, in a world where we do not get prosaic AI but instead are forced to develop new deep insights into the nature of intelligence before we can build machines with the capacity for highly effective goal-directed action, investigating the accumulation of knowledge might also be fruitful. Among processes that converge towards a small set of target configurations despite perturbations along the way — say, a ball rolling down a hill, a computer computing the square root of two by gradient descent, and a team of humans building a house — it is only the team of humans building a house that do so in a way that involves the accumulation of knowledge. It might be that the central difference between systems that exhibit broad "optimizing" behavior, and the subset of those systems that do so due to the agency of an entity embedded within them, is the accumulation of knowledge. Furthermore, we might be able to understand the accumulation of knowledge without reference to the problematic agent model in which the agent and environment are separated, and the agent is assumed to behave according to an immutable internal decision algorithm.
In summary, investigating the accumulation of knowledge could be a promising line of attack on both the problem of understanding agency without presupposing a dualistic agent model, as well as the problem of detecting dangerous patterns of cognition in agents engineered via large-scale search processes. The key question seems to be: is knowledge real? Is knowledge a fundamental aspect of all systems that have the capacity for effective goal-directed action, or is it a fuzzy intermediate quantity acquired by some intelligent systems and not others?
This sequence, unfortunately, does not give any final answers to these questions. The next four posts will explore four failed definitions of the accumulation of knowledge and go over counterexamples to each one.
Suppose I show you a physically closed system — say, for the sake of concreteness, a shipping container with various animals and plants and computer systems moving about and doing things inside — and tell you that knowledge is accumulating within a certain physical region within the system. What does this mean, at the level of physics?
Or suppose that I show you a cellular automata — say, a snapshot of Conway’s Game of Life — and I point to a region within the overall game state and claim that knowledge is accumulating within this region. Without any a priori knowledge of the encoding of this hypothesized knowledge, nor of the boundary between any hypothesized agent and environment, nor of the mechanism by which any hypothesized computation is happening, can you test my claim?
Or even more abstractly, if I show you a state vector evolving from one time step to the next according to a transition function and I claim that knowledge is accumulating within some particular subset of the dimensions of this state vector, can you say what it would mean for my claim to be true?
I have been seeking a definition of knowledge as a correspondence between the configuration of a region and the configuration of the overall system, but I have not found a satisfying definition. In this sequence I will describe the attempts I've made and the challenges that I've come up against.
What a definition should accomplish
The desiderata that I’ve been working with are as follows. I’ve chosen these based on the AI-related motivations described above.
A definition should provide necessary and sufficient conditions for the accumulation of knowledge such that any entity that exerts goal-directed influence over the future must accumulate knowledge according to the definition.
A definition should be expressed at the level of physics, which means that it should address what it means for knowledge to accumulate within a given spatial region, without presupposing any particular structure to the system inside or outside of that region.
In particular there should not be reference to "agent" or "computer" as ontologically fundamental concepts within the definition. However, a definition of the accumulation of knowledge might include sub-definitions of "agent" or "computer", and of course it’s fine to use humans, robots and digital computers as examples and counterexamples.
The following are non-goals:
Practical means for detecting the accumulation of knowledge in a system.
Practical means for limiting the accumulation of knowledge in a system.
The failed definitions of the accumulation of knowledge that I will explore in the ensuing posts in this sequence are as follows. I will be posting one per day this week.
Direct map/territory resemblance
Attempted definition: Knowledge is accumulating whenever a region within the territory bears closer and closer resemblance to the overall territory over time, such as when drawing a physical map with markings that correspond to the locations of objects in the world.
Problem: Maps might be represented in non-trivial ways that make it impossible to recognize a map/territory resemblance when examining the system at a single point in time, such as a map that is represented within computer memory rather than on a physical sheet of paper.
Mutual information between region and environment
Attempted definition: Knowledge is accumulating whenever a region within the territory and the remainder of the territory are increasing in mutual information over time.
Problem: The constant interaction between nearby physical objects means that even a rock orbiting the Earth is acquiring enormous mutual information with the affairs of humans due to the imprinting of subatomic information onto the surface of rock by photons bouncing off the Earth, yet this does not constitute knowledge.
Mutual information over digital abstraction layers
Attempted definition: Knowledge is accumulating whenever a digital abstraction layer exists and there is an increase over time in mutual information between its high-level and low-level configurations. A digital abstraction layer is a grouping of low-level configurations into high-level configurations such that transitions between high-level configurations are predictable without knowing the low-level configurations.
Problem: A digital computer that is merely recording everything it observes is acquiring more knowledge, on this definition, than a human who cannot recall their observations but can construct models and act on them.
Precipitation of action
Attempted definition: Knowledge is accumulating when an entity’s actions are becoming increasingly fine-tuned to a particular configuration of the environment over time.
Problem: A sailing ship that is drawing a map of a coastline but sinks before the map is ever used by anyone to take action would not be accumulating knowledge by this definition, yet does in fact seem to be accumulating knowledge.
The final post in the sequence reviews some of the philosophical literature on the subject of defining knowledge, as well as a few related posts here on lesswrong.
Planned summary for the Alignment Newsletter:
This was a great summary, thx.
Your summaries are excellent Rohin. This looks good to me.
I think that part of the problem is that talking about knowledge requires adopting an interpretative frame. We can only really say whether a collection of particles represents some particular knowledge from within such a frame, although it would be possible to determine the frame of minimum complexity that interprets a system as representing certain facts. In practise though, whether or not a particular piece of storage contains knowledge will depend on the interpretative frames in the environment, although we need to remember that interpretative frames can emulate other interpretative frames. ie. A human experimenting with multiple codes in order to decode a message.
Regarding the topic of partial knowledge, it seems that the importance of various facts will vary wildly from context to context and also depending on the goal. I'm somewhat skeptical that goal independent knowledge will have a nice definition.
Well yes I agree that knowledge exists with respect to a goal, but is there really no objective difference an alien artifact inscribed with deep facts about the structure of the universe and set up in such a way that it can be decoded by any intelligent species that might find it, and an ordinary chunk of rock arriving from outer space?
Well, taking the simpler case of exacting reproducing a certain string, you could find the simplest program that produces the string similar to Kolmogorov complexity and use that as a measure of complexity.
A slightly more useful way of modelling things may be to have a bunch of different strings with different points representing levels of importance. And perhaps we produce a metric combining the Kolmovorov complexity of a decoder with the sum of the points produced where points are obtained by concatenating the desired strings with a predefined separator. For example, we might find the quotient.
One immediate issue with this is that some of the strings may contain overlapping information. And we'd still have to produce a metric to assign importances to the strings. Perhaps a simpler case would be where the strings represent patterns in a stream via encoding a Turing machine with the Turing machines being able to output sets of symbols instead of just symbols representing the possible symbols at each locations. And the amount of points they provide would be equal to how much of the stream it allows you to predict. (This would still require producing a representation of the universe where the amount of the stream predicted would be roughly equivalent to how useful the predictions are).
Any thoughts on this general approach?
Well here is a thought: a random string would have high Kolmogorov complexity, as would a string describing the most fundamental laws of physics. What are the characteristics of the latter that conveys power over one's environment to an agent that receives it, that is not conveyed by the former? This is the core question I'm most interested in at the moment.
I think grappling with this problem is important because it leads you directly to understanding that what you are talking about is part of your agent-like model of systems, and how this model should be applied depends both on the broader context and your own perspective.
Doesn't seem to problematic. It was acquiring information. If you are acquiring info, and then you die, then yes, knowledge may (and probably will) be lost.
Well if I learn that my robot vacuum is unexpectedly building a model of human psychology then I'm concerned whether or not it in fact acts on that model, which means that I really want to define "knowledge" in a way that does not depend on whether a certain agent acts upon it.
For the same reason I think it would be natural to say that the sailing ship had knowledge, and that knowledge was lost when it sank. But if we define knowledge in terms of the actions that follow then the sailing ship never had knowledge in the first place.
Now you might say that it was possible that the sailing ship would have survived and acted upon its knowledge of the coastline, but imagine a sailing ship that, unbeknownst to it, is sailing into a storm in which it will certainly be destroyed, and along the way is building an accurate map of the coastline. I would say that the sailing ship is accumulating knowledge and that the knowledge is lost when the sailing ship sinks. But the attempted definition from this post would say that the sailing ship is not accumulating knowledge at all, which seems strange.
It's of course important to ground out these investigations in practical goals or else we end up in an endless maze of philosophical examples and counter-examples, but I do think this particular concern grounds out in the practical goal of overcoming deception in policies derived from machine learning.
Is this sequence complete? I was expecting a final literature review post before summarizing for the newsletter, but it's been a while since the last update and you've posted something new, so maybe you decided to skip it?
The sequence is now complete.
It's actually written, just need to edit and post. Should be very soon. Thanks for checking on it.
Building off Chris' suggestion about Kolmogorov complexity, what if we consider the Kolmogorov complexity of thing we want knowledge about (e.g. the location of an object) given the 'knowledge containing' thing (e.g. a piece of paper with the location coordinates written on it) as input.
Wikipedia tells me this is called the 'conditional Kolmogorov complexity' of x (the thing we want knowledge about) given r (the state of the region potentially containing knowledge), K(x|r)
(Chris I'm not sure if I understood all of your comment, so maybe this is what you were already gesturing at.)
It seems like the problem you (Alex) see with mutual information as a metric for knowledge is that it doesn't take into account how "useful and accessible" that information is. I am guessing that what you mean by 'useful' is 'able to be used' (i.e. if the information was 'not useful' to an agent but simply because the agent didn't care about it, I'm guessing we wouldn't want to therefore saying the knowledge isn't there), so I'm going to take the liberty of saying "usable" here to capture the "useful and accessible" notion (But please correct me if I'm misunderstanding you).
I can see two ways that information can be less easily "usable" for a given agent. 1. Physical constraint. e.g a map is locked in a safe so it's hard for the agent to get to it, or the map is very far away. 2. Complexity, e.g. rather than a map we have a whole bunch of readings from sensors from a gokart that has driven around the area which we want to know the layout of. This is less easily "usable" than a map, because we need a longer algorithm to extract the answers we want from it (e.g. "what road will this left turn take me to) [EDIT: Though maybe I'm equivocating between Kolmogorov complexity and runtime complexity here?] This second way of being less easily usable is what duck_master articulates in their comment (I think!).
It makes sense to me to not use sense 1 (physical constraint) in our definition of knowledge, because it seems like we want to say a map contains knowledge regardless of whether it is, in your example for another post, at the bottom of the ocean or not.
So then we're left with sense 2, for which we can use the conditional Kolmogorov complexity to make a metric.
To be more specific, perhaps we could say that for a variable X (e.g. the location of an object), and the state r of some physical region (e.g. a map), the knowledge which r contains about X is
where x is the value of the variable X.
This seems like the kind of thing that would already have a name, so I just did some Googling and yes it looks like this is "Absolute mutual information", notated IK(x,r).
Choosing this way to define knowledge means we include cases where the knowledge is encoded by chance-- e.g. If someone draws a dot on a map at random and the dot coincidentally matches the position of an object, this metric would say that the map now does contain knowledge about the position of an object. I think this is a good thing-- It means that we can e.g. consider at a rock that came in from outer space with an inscription on it and say whether it contains knowledge, without having to know about the causal process that produced those inscriptions. But if we wanted to only include cases where there's a reliable correlation and not just chance, we could modify the metric (perhaps just modify it to the expected absolute mutual information E(IK(X,R))).
P.S. I commented on another post in this sequence with a different idea last night, but I like this idea better :)
Hm on reflection I actually don't think this does what I thought it did. Specifically I don't think it captures the amount of 'complexity barrier' reducing the usability of the information. I think I was indeed equivocating between computational (space and time) complexity, vs. Kolmogorov complexity. My suggestion captures the later, not the former.
Also, some further Googling has told me that the expected absolute mutual information, my other suggestion at the end, is "close" to Shannon mutual information (https://arxiv.org/abs/cs/0410002) so doesn't seem like that's actually significantly different to the mutual information option which you already discussed.
Au contraire, I think that "mutual information between the object and the environment" is basically the right definition of "knowledge", at least for knowledge about the world (as it correctly predicts that all four attempted "counterexamples" are in fact forms of knowledge), but that the knowledge of an object also depends on the level of abstraction of the object which you're considering.
For example, for your rock example: A rock, as a quantum object, is continually acquiring mutual information with the affairs of humans by the imprinting of subatomic information onto the surface of rock by photons bouncing off the Earth. This means that, if I was to examine the rock-as-a-quantum-object for a really long time, I would know the affairs of humans (due to the subatomic imprinting of this information on the surface of the rock), and not only that, but also the complete workings of quantum gravity, the exact formation of the rock, the exact proportions of each chemical that went into producing the rock, the crystal structure of the rock, and the exact sequence of (micro-)chips/scratches that went into making this rock into its current shape. I feel perfectly fine counting all this as the knowledge of the rock-as-a-quantum-object, because this information about the world is stored in the rock.
(Whereas, if I were only allowed to examine the rock-as-a-macroscopic-object, I would still know roughly what chemicals it was made of and how they came to be, and the largest fractures of the rock, but I wouldn't know about the affairs of humans; hence, such is the knowledge held by the rock-as-a-macroscopic-object. This makes sense because the rock-as-a-macroscopic-object is an abstraction of the rock-as-a-quantum-object, and abstractions always throw away information except that which is "useful at a distance".)
For more abstract kinds of knowledge, my intuition defaults to question-answering/epistemic-probability/bet-type definitions, at least for sufficiently agent-y things. For example, I know that 1+1=2. If you were to ask me, "What is 1+1?", I would respond "2". If you were to ask me to bet on what 1+1 was, in such a way that the bet would be instantly decided by Omega, the omniscient alien, I would bet with very high probability (maybe 40:1odds in favor, if I had to come up with concrete numbers?) that it would be 2 (not 1, because of Cromwell's law, and also because maybe my brain's mental arithmetic functions are having a bad day). However, I do not know whether the Riemann Hypothesis is true, false, or independent of ZFC. If you asked me, "Is the Riemann Hypothesis true, false, or independent of ZFC?", I would answer, "I don't know" instead of choosing one of the three possibilities, because I don't know. If you asked me to bet on whether the Riemann Hypothesis was true, false, or independent of ZFC, with the bet to be instantly decided by Omega, I might bet 70% true, 20% false, and 10% independent (totally made-up semi-plausible figures that no bearing on the heart of the argument; I haven't really tested my probabilistic calibration), but I wouldn't put >95% implied probability on anything because I'm not that confident in any one possibility. Thusly, for abstract kinds of knowledge, I think I would say that an agent (or a sufficiently agent-y thing) knows an abstract fact X if it tells you about this fact when prompted with a suitably phrased question, and/or if it places/would place a bet in favor of fact X with very high implied probability if prompted to bet about it.
(One problem with this definition is that, intuitively, when I woke up today, I had no idea what 384384*20201 was; the integers here are also completely arbitrary. However, after I typed it into a calculator and got 7764941184, I now know that 384384*20201 = 7764941184. I think this is also known as the problem of logical omniscience; Scott Aaronson once wrote a pretty nice essay about this topic and others from the perspective of computational complexity.)
I have basically no intuition whatsoever on what it means for a rock* to know that the Riemann Hypothesis is true, false, or independent of ZFC. My extremely stupid and unprincipled guess is that, unless a rock is physically inscribed with a proof of the true answer, it doesn't know, and that otherwise it does.
*I'm using a rock here as a generic example of a clearly-non-agentic thing. Obviously, if a rock was an agent, it'd be a very special rock, at least in the part of the multiverse that I inhabit. Feel free to replace "rock" with other words for non-agents.
Thank you for this comment duck_master.
I take your point that it is possible to extract knowledge about human affairs, and about many other things, from the quantum structure of a rock that has been orbiting the Earth. However, I am interested in a definition of knowledge that allows me to say what a given AI does or does not know, insofar as it has the capacity to act on this knowledge. For example, I would like to know whether my robot vacuum has acquired sophisticated knowledge of human psychology, since if it has, and I wasn't expecting it to, then I might choose to switch it off. On the other hand, if I merely discover that my AI has recorded some videos of humans then I am less concerned, even if these videos contain the basic data necessary to constructed sophisticated knowledge of human psychology, as in the case with the rock. Therefore I am interested not just in information, but something like action-readiness. I am referring to that which is both informative and action-ready as "knowledge", although this may be stretching the standard use of this term.
Now you say that we might measure more abstract kinds of knowledge by looking at what an AI is willing to bet on. I agree that this is a good way to measure knowledge if it is available. However, if we are worried that an AI is deceiving us, then we may not be willing to trust its reports of its own epistemic state, or even of the bets it makes, since it may be willing to lose money now in order to convince us that it is not particularly intelligent, in order to make a treacherous turn later. Therefore I would very much like to find a definition that does not require me to interact with the AI through its input/output channels in order to find out what it knows, but rather allows me to look directly at its internals. I realize this may be impossible, but this is my goal.
So as you can see, my attempt at a definition of knowledge is very much wrapped up with the specific problem I'm trying to solve, and so any answers I arrive at may not be useful beyond this specific AI-related question. Nevertheless, I see this as an important question and so am content to be a little myopic in my investigation.
Thanks for the reply. I take it that not only are you interested in the idea of knowledge, but that you are particularly interested in the idea of actionable knowledge.
Upon further reflection, I realize that all of the examples and partial definitions I gave in my earlier comment can in fact be summarized in a single, simple definition: a thing X has knowledge of a fact Y iff it contains some (sufficiently simple) representation of Y. (For example, a rock knows about the affairs of humans because it has a representation of those affairs in the form of Fisher information, which is enough simplicity for facts-about-the-world.) Using this definition, it becomes much easier to define actionable knowledge: a thing X has actionable knowledge of a fact Y iff it contains some sufficiently simple representation of Y, and this representation is so simple that an agent with access to this information could (with sufficiently minimal difficulty) make actions that are based on fact Y. (For example, I have actionable knowledge that 1 + 1 = 2, because my internal representation of this fact is so simple that I can literally type up its statement in a comment.) It also becomes clearer that actionable knowledge and knowledge are not the same (since, for example, the knowledge about the world that a computer that records cryptographic hashes of everything it observes could not be acted upon without breaking the hashes, which is presumably infeasible).
So as for the human psychology/robot vacuum example: If your robot vacuum's internal representation of human psychology is complex (such as in the form of video recordings of humans only), then it's not actionable knowledge and your robot vacuum can't act on it; if it's sufficiently simple, such as a low-complexity-yet-high-fidelity executable simulation of a human psyche, your robot vacuum can. My intuition also suggests in this case that your robot vacuum's knowledge of human psychology is actionable iff it has a succinct representation of the natural abstraction of "human psychology" (I think this might be generalizable; i.e. knowledge is actionable iff it's succinct when described in terms of natural abstractions), and that finding out whether your robot vacuum's knowledge is sufficiently simple is essentially a matter of interpretability. As for the betting thing, the simple unified definition that I gave in the last paragraph should apply as well.
I very much agree with the emphasis on actionability. But what is it about a physical artifact that makes the knowledge it contains actionable? I don't think it can be simplicity alone. Suppose I record the trajectory of the moon over many nights by carving markings into a piece of wood. This is a very simple representation, but it does not contain actionable knowledge in the same way that a textbook on Newtonian mechanics does, even if the textbook were represented in a less simple way (say, as a PDF on a computer).
Then how can it ever be absent?
I think knowledge as a whole cannot be absent, but knowledge of a particular fact can definitely be absent (if there's no relationship between the thing-of-discourse and the fact).
So rocks have non zero knowledge?