This is so cool! Thanks so much, I plan to go through it in full when I have some time. For now, I was wondering if the red circled matrix multiplication should actually be reversed, and the vector should be column (ie. matrix*column, instead of row*matrix). I know the end result is equivalent but it seems in order to be consistent it should be switched, ie in every other example of a vector with leg sticking out leftward its a column vector? maybe this really doesnt matter since I can just turn the page upside down and then b would be on the left with a leg sticking out to the right..., but the fact that A dot b = b.T dot A is itself an interesting fact.
Just to add to Carl Feynman's response, which I thought was good.
Part of the reason these systems are inefficient is because it requires you to (effectively) run gradient descent even at inference, even after training is over. Or you can run the RNN, which is mathematically equivalent but again you can see where the inefficiency comes in: the value at time t=3 is a function of the value at time t=2, which is a function of t=1 and so on, so in order to get the converged value of the activations you have to, in a for loop, compute each timestep one by one.
This is in contrast to a feedforward network like a (normal) convnet or transformer, which can run extremely quickly and in parallel on gpu.
Thanks!
I think your thinking makes sense, and, if for instance on every timestep you presented a different images in a stereotypically defined sequence, or with a certain correlation structure, you would indeed get information about those correlations in the weights. However, this model was designed to be used in the restricted to settings where you show a single still image for many timesteps until convergence. In that setting, weights give you image features for static images (in a heirarchical manner), and priors for low level features will feed back from activations in higher level areas.
There are extensions to this model that deal with video, where there are explicit spatiotemporal expectations built into the network. you can see one of those networks in this paper: https://arxiv.org/abs/2112.10048
But I've never implemented such a network myself.
First, brains (and biological systems more generally) have many constraints that artificial networks do not. Brains exist in the context of a physically instantiated body, with heavy energy constraints. Further, they exist in specific niches, with particular evolutionary histories, which has enormous effects on structure and function.
Second, biological brains have different types of intelligence from AI systems, at least currently. A bird is able to land fluidly on a thin branch in windy conditions, while gpt4 can help you code. In general, the intelligences that one thinks of in the context of AGI do not totally overlap with the varied, often physical and metabolic, intelligences of biology.
All that being said, who knows what future AI systems will look like
Thanks so much for this comment (and sorry for taking ~1 year to respond!!). I really liked everything you said.
For 1 and 2, I agree with everything and don't have anything to add.
3. I agree that there is something about the input/output mapping that is meaningful but it is not everything. Having a full theory for exactly the difference, and what the distinctions between what structure counts as interesting internal computation (not a great descriptor of what I mean but can't think of anything better right now) vs input output computation would be great.
4. I also think a great goal would be in generalizing and formalizing what an "observer" of a computation is. I have a few ideas but they are pretty half-baked right now.
5. That is an interesting point. I think it's fair. I do want to be careful to make sure that any "disagreements" are substantial and not just semantic squabling here. I like your distinction between representation work and computational work. The idea of using vs. performing a computation is also interesting. At the end of the day I am always left craving some formalism where you could really see the nature of these distinctions.
6. Sounds like a good idea!
7. Agreed on all counts.
8. I was trying to ask the question if there is anything that tells us that the output node is semantically meaningful without reference to e.g. the input images of cats, or even knowledge of the input data distribution. Interpretability work, both in artificial neural networks and more traditionally in neuroscience, always use knowledge of input distributions or even input identity to correlate activity of neurons to the input, and in that way assign semantics to neural activity (e.g. recently, othello board states, or in neuroscience jennifer aniston neurons or orientation tuned neurons) . But when I'm sitting down with my eyes closed and just thinking, there's no homonculus there that has access to input distributions on my retina that can correlate some activity pattern to "cat." So how can the neural states in my brain "represent" or embody or whatever word you want to use, the semantic information of cat, without this process of correlating to some ground truth data. WHere does "cat" come from when theres no cat there in the activity?!
9. SO WILD
Can you explain what you mean by second or third order dynamics? That sounds interesting. Do you mean e.g. the order of the differential equation or something else?
This is not obvious to me. It seems somewhat likely that the multimodaility actually induces more explicit representations and uses of human-level abstract concepts, e.g. a Jennifer Aniston neuron in a human brain is multimodal.
This is the standard understanding in neuroscience (and for what its worth is my working belief), but there is some evidence that throws a wrench into this idea, and needs to be explained, for instance this review "Consciousness without a cerebral cortex: a challenge for neuroscience and medicine" where evidence towards the idea that consciousness without a cortex can occur. in particular this is a famous case of a human with hardly any cortex that seemed to act normally, in most regards.
I think the issue is that what people often mean by. "computing matrix multiplication" is something like what youve described here, but when (at least sometimes, as you've so elegantly talked about in other posts, vibes and context really matter!) talk about "recognizing dogs" they are referring not only to the input output transformation of the task (or even the physical transformation of world states) but also the process by which the dog is recognized, which includes lots of internal human abstractions moving about in a particular way in the brains of people, which may or may not be recapitulated in an artificial classification system.
To some degree it's a semantic issue. I will grant you that there is a way of talking about "recognizing dogs" that reduces it to the input/output mapping, but there is another way in which this doesn't work. The reason it makes sense for human beings to have these two different notions of performing a task is because we really care about theory of mind, and social settings, and figuring out what other people are thinking (and not just the state of their muscles or whatever dictates their output).
Although for precisions sake, maybe they should really have different words associated with them, though I'm not sure what the words should be exactly. Maybe something like "solving a task" vs. "understanding a task" though I don't really like that.
Actually my thinking can go the other way to. I think there actually is a sense in which the computer is not doing matrix multiplication, and its really only the system of computer+human that is able to do it, and the human is doing A LOT of work here. I recognize this is not the sense people usually mean when they talk about computers doing matrix multiplication, but again, I think there are two senses of performing a computation even though people use the same words.
The blog post linked says it's from August. Is there something new I'm missing?