Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Epistemic status: my own thoughts I've thought up in my own time. They may be quite or very wrong! I am likely not the first person to come to these ideas. All of my main points here are just hypotheses which I've come to by the reasoning stated below. Most of it is informal mathematical arguments about likely phenomena and none is rigorous proof. I might investigate them if I had the time/money/programming skills. Lots of my hypotheses are really long and difficult-to-parse sentences.

What is knowledge?

I think this question is bad.

It's too great of a challenge. It asks us (implicitly) for a mathematically rigorous definition which fits all of our human feelings about a very loaded word. This is often a doomed endeavour from the start, as human intuitions don't neatly map onto logic. Also, humans might disagree on what things count as or do not count as knowledge. So let's attempt to right this wrong question:

Imagine a given system is described as "knowing" something. What is the process that leads to the accumulation of said knowledge likely to look like?

I think this is much better.

We limit ourselves to systems which can definitely be said to "know" something. This allows us to pick a starting point. This might be a human, GPT-3, or a neural network which can tell apart dogs and fish. In fact this will be my go-to answer for the future. We also don't need to perfectly specify the process which generates knowledge all at once, only comment on its likely properties.

Properties of "Learning"

Say we have a very general system, with parameters , with  representing time during learning. Let's say they're initialized as  according to some random distribution. Now it interacts with the dataset which we will represent with , taken from some distribution over possible datasets. The learning process will update , so we can represent the parameters the parameters after some amount of time as . This reminds us that the set of parameters depends on three things: the initial parameters, the dataset, and the amount of training.

Consider . This is trivially equal to , and so it depends only on the choice of . The dataset has had no chance to affect the parameters in any way.

So what about as ? We would expect that  depends mostly on the choice of  and much less strongly on . There will presumably be some dependency on initial conditions, especially for very complex models like a big neural network with many local minima. But mostly it's  which influences .

So far this is just writing out basic sequences stuff. To make a map of the city you have to look at it, and to learn your model has to causally entangle itself with the dataset. But let's think about what happens when  is slightly different.

Changes in the world

So far we've represented the whole dataset with a single letter , as if it were just a number or something. But in reality it will have many, many independent parts. Most datasets which are used as inputs to learning processes are also highly structured.

Consider the dog-fish discriminator, trained on the dataset . The system  could be said to have "knowledge" that "dogs have two eyes". One thing this means if we instead fed it an  which was identical except every dog had three eyes (TED) then the final values of  would be different. The same is true of facts like "fish have scales", "dogs have one tail". We could express this as follows:

Where  is the modification of "photoshopping the dogs to have three eyes". We now have:

Now let's consider how  behaves. For lots of choices of  it might just be a series of random changes tuning the whole set of  values. But from my knowledge of neural networks, it might not be. Lots of image recognizing networks have been found to contain neurons with specific functions which relate to structures in the data, from simple line detectors, all the way up to "cityscape" detectors.

For this reason I suggest the following hypothesis:

Structured and localized changes in the dataset that a parameterized learning system is exposed to will cause localized changes in the final values of the parameters.

Impracticalities and Solutions

Now it would be lovely to train all of GPT-3 twice, once with the original dataset, and once in a world where dogs are blue. Then we could see the exact parameters that lead it to return sentences like "the dog had [chocolate rather than azure] fur". Unfortunately rewriting the whole training dataset around this is just not going to happen.

Finding the flow of information, and influence in a system is easy if you have a large distribution of different inputs and outputs (and a good idea of the direction of causality). If you have just a single example, you can't use any statistical tools at all. 

So what else can we do? Well we don't just have access to . In principle we could look at the course of the entire training process and how  changes over time. For each timestep, and each element of the dataset , we could record how much each element of  is changed. We'll come back to this

Let's consider the dataset as a function of the external world: . All the language we've been using about knowledge has previously only applied to the dataset. Now we can describe how it applies to the world as a whole.

For some things the equivalence of knowledge of  and  is pretty obvious. If the dataset is being used for a self-driving car and it's just a bunch of pictures and videos then basically anything the resulting parameterised system knows about  it also knows about . But for obscure manufactured datasets like [4000 pictures of dogs photoshopped to have three eyes] then it's really not clear.

Either way, we can think about  as having influence over  the same way as we can think about  as having influence over . So we might be able to form hypotheses about this whole process. Let's go back to . First off imagine a change , such as "dogs have three eyes". This will change some elements of  more than others. Certain angles of dog photos, breeds of dogs, will be changed more. Photos of fish will stay the same!

Now we can imagine a function . This represents some propagation of influence from . Note that the influence of  on  is independent of our training process or . This makes sense because different bits of the training dataset contain information about different bits of the world. How different training methods extract this information might be less obvious.

The Training Process

During training,  is exposed to various elements of  and updated. Different elements of  will update  by different amounts. Since the learning process is about transferring influence over  from  to  (acting via ), we might expect that for a given element of , it has more "influence" over the final values of the elements of  which were changed the most due to exposure to that particular element of  during training.

This leads us to a second hypothesis:

The degree to which an element of the dataset causes an element of the parameters to be updated during training is correlated with the degree to which a change to that dataset element would have caused a change in the final value of the parameter.

Which is equivalent to:

Knowledge of a specific properties of the dataset is disproportionately concentrated in the elements of the final parameters that have been updated the most during training when "exposed" to certain dataset elements that have a lot of mutual information with that property.

For the dog-fish example: elements of parameter space which have updated disproportionately when exposed to photos of dogs that contain the dogs' heads (and therefore show just two eyes), will be more likely to contain "knowledge" of the fact that "dogs have two eyes".

This naturally leads us to a final hypothesis:

Correlating update-size as a function of dataset-element across two models will allow us to identify subsets of parameters which contain the same knowledge across two very different models.


Access to a simple interpreted model of a system will allow us to rapidly infer information about a much larger model of the same system if they are trained on the same datasets, and we have access to both training histories.


I think an AI which takes over the world will have a very accurate model of human morality, it just won't care about it. I think that one way of getting the AI to not kill us is to extract parts of the human utility-function-value-system-decision-making-process-thing from its model and tell the AI to do those. I think that to do this we need to understand more about where exactly the "knowledge" is in an inscrutable model. I also find thinking about this very interesting.

New Comment
4 comments, sorted by Click to highlight new comments since:

Meta commentary: this post is a great example of how to do the very earliest stages of conceptual research. Well done.

Nice post. I agree that a crucial part of AGI alignment should involve routing an AI's knowledge of human values to its own internal motivational circuitry, such that as its knowledge of human needs/goals/drives/preferences grows, so too does its alignment to those things. One key to this part of the problem may be to build in structural and inductive biases that steer the AI toward less inscrutable models.

I would say that to "know" something necessitates being able to make accurate predictions related to that thing. For most learning systems, this would imply developing some sort of generative or predictive model of its training data. In your dog/fish example, this might be realized with something like a conditional GAN, maybe combined with an autoencoder, where "knowing" the class of a sample allows the model to predict features of the sample (e.g., "fish" class -> there will be fins about here and scales about here; "dog" class -> there will be three eyes on the face, furry texture on the body, etc.). Combining the class label with some sort of latent-space representation should enable it to closely reproduce the full image.

The "knowledge" here is contained less in the class labels and latent space representations and more in the parameters and structure of the generative model, which is where it actually learned the generative/causal structure of its training data. This kind of knowledge allows such models to do things like inpainting, denoising, super-resolution, and animation of an image, generating information that was not in its inputs but that it predicts "ought" to be there based on what it has learned before.

This idea is also related to the predictive coding theory of the brain, where perception happens by constantly trying to generate predictions of what the senses will receive and continuously updating based on prediction errors. Again, "knowledge" exists in the generative models and causal graphs that the brain uses to make these predictions.

Thanks! I get your arguments about "knowledge" being restricted to predictive domains, but I think it's (mostly) just a semantic issue. I also don't think the specifics of the word "knowledge" are particularly important to my points which is what I attempted to clarify at the start, but I've clearly typical-minded and assumed that of course everyone would agree with me about a dog/fish classifier having "knowledge", when it's more of an edge-case than I thought! Perhaps a better version of this post would have either tabooed "knowledge" altogether or picked a more obviously-knowledge-having model.

Well, it certainly has mutual information with the training data, even if it only acts as a classifier (actually, classifiers can be seen as inverse generative models, so there is some generative-ish information there, as well). From that perspective, your arguments certainly hold. Although, I'm not sure if "mutual information" is precisely what you're going for, either. Yes, I agree, I should have tabooed "knowledge" in how I read it.