I don't think I understand, concretely, what a non-mechanistic model looks like in your view. Can you give a concrete example of a useful non-mechanistic model?
Let's say your causal model looks something like this:
What causes you to specifically call out "sunblessings" as the "correct" upstream node in the world model of why you take your friend to dinner, as opposed to "fossil fuels" or "the big bang" or "human civilization existing" or "the restaurant having tasty food"?
Or do you reject the premise that your causal model should look like a tangled mess, and instead assert that it is possible to have a useful tree-shaped causal model (i.e. one that does not contain joining branches or loops).
Like, rationalists intellectually know that thermodynamics is a thing, but it doesn't seem common to for rationalists to think of everything important as being the result of emanations from the sun.
I expect if you took a room with 100 rationalists, and told them to consider something that is important to them, and then asked them how that thing came to be, and then had them repeat the process 25 more times, at least half of the rationalists in the room would include some variation of "because the sun shines" within their causal chains. At the same time, I don't think rationalists tend to say things like "for dinner, I think I will make a tofu stir fry, and ultimately I'm able to make this decision because there's a ball of fusing hydrogen about away".
Put another way, I expect that large language models encode many salient learned aspects of their environments, and that those attributes are largely detectable in specific places in activation space. I do not expect that large language models encode all of the implications of those learned aspects of their environments anywhere, and I don't particularly expect it to be possible to mechanistically determine all of those implications without actually running the language model. But I don't think "don't hold the whole of their world model, including all implications thereof, in mind at all times" is something particular to LLMs.
Large language models absolutely do not have a representation for thing 2, because the whole of kings has shattered into many different shards before they were trained, and they've only been fed scattered bits and pieces of it.
Do humans have a representation for thing 2?
Now my words start to sound blatant and I look like an overly confident noob, but... This phrase... Most likely is an outright lie. GPT-4 and Gemini aren't even two-task neuros. Both cannot take a picture and edit it. Instead, an image recognition neuro gives the text to the blind texting neuro, that only works with text and lacks the basic understanding of space. That neuro creates a prompt for an image generator neuro, that can't see the original image.
Ignoring for the moment the "text-to-image and image-to-text models use a shared latent space to translate between the two domains and so they are, to a significant extent operating on the same conceptual space" quibble...
GPT-4 and Gemini can both use tools, and can also build tools. Humans without access to tools aren't particularly scary on a global scale. Humans with tools can be terrifying.
"Agency" Yet no one have told that it isn't just a (quite possibly wrong) hypothesis. Humans don't work like that: no one is having any kind of a primary unchangeable goal they didn't get from learning, or wouldn't overcome for a reason. Nothing seems to impose why a general AI would, even if (a highly likely "if") it doesn't work like a human mind.
There are indeed a number of posts and comments and debates on this very site making approximately that point, yeah.
It would be perverse to try to understand a king in terms of his molecular configuration, rather than in the contact between the farmer and the bandit.
It sure would.
But through gradient descent, shards act upon the neural networks by leaving imprints of themselves
Indeed.
these imprints have no reason to be concentrated in any one spot of the network (whether activation-space or weight-space)
This follows for weight-space, but I think it doesn't follow for activation space. We expect that the ecological role of king is driven by some specific pressures that apply in certain specific circumstances (e.g. in the times that farmers would come in contact with bandits), while not being very applicable at most other times (e.g. when the tide is coming in). As such, to understand the role of the king, it is useful to be able to distinguish times when the environmental pressure strongly applies from the times when it does not strongly apply. Other inferences may be downstream of this ability to distinguish, and there will be some pressure for these downstream inferences to all refer to the same upstream feature, rather than having a bunch of redundant and incomplete copies. So I argue that there is in fact a reason for these imprints to be concentrated into a specific spot of activation space.
Recent work on SAEs as they apply to transformer residuals seem to back this intuition up.
Also potentially relevant: "The Quantization Model of Neural Scaling" (Michaud et. al. 2024)
We propose the Quantization Model of neural scaling laws, explaining both the observed power law dropoff of loss with model and data size, and also the sudden emergence of new capabilities with scale. We derive this model from what we call the Quantization Hypothesis, where network knowledge and skills are “quantized” into discrete chunks (quanta). We show that when quanta are learned in order of decreasing use frequency, then a power law in use frequencies explains observed power law scaling of loss. We validate this prediction on toy datasets, then study how scaling curves decompose for large language models. Using language model gradients, we automatically decompose model behavior into a diverse set of skills (quanta). We tentatively find that the frequency at which these quanta are used in the training distribution roughly follows a power law corresponding with the empirical scaling exponent for language models, a prediction of our theory.
"Create bacteria that can quickly decompose any metal in any environment, including alloys and including metal that has been painted, and which also are competitive in those environments, and will retain all of those properties under all of the diverse selection pressures they will be under worldwide" is a much harder problem than "create bacteria that can decompose one specific type of metal in one specific environment", which in turn is harder than "identify specific methanobacteria which can corrode exposed steel by a small fraction of a millimeter per year, and find ways to improve that to a large fraction of a millimeter per year."
Also it seems the mechanism is "cause industrial society to collapse without killing literally all humans" -- I think "drop a sufficiently large but not too large rock on the earth" would also work to achieve that goal, you don't have to do anything galaxy-brained.
I am struggling to see how we do lose 80%+ of these jobs within the next 3 years.
Operationalizing this, I would give you 4:1 that the fraction (or raw number, if you'd prefer) of employees occupied as travel agents is over 20% of today's value, according to the Labor Force Statistics from the US Bureau of Labor Statistics Current Population Survey (BLS CPS) Characteristics of the Employed dataset.
For reference, here are the historical values for the BLS CPS series cpsaat11b ("Employed persons by detailed occupation and age") since 2011 (which is the earliest year they have it available as a spreadsheet). If you want to play with the data yourself, I put it all in one place in google sheets here.
As of the 2023 survey, about 0.048% of surveyed employees, and 0.029% of surveyed people, were travel agents. As such, I would be willing to bet at 4:1 that when the 2027 data becomes available, at least 0.0096% of surveyed employees and at least 0.0058% of surveyed Americans report their occupation as "Travel Agent".
Are you interested in taking the opposite side of this bet?
Edit: Fixed aritmetic error in the percentages in the offered bet
Suppose that there is some search process that is looking through a collection of things, and you are an element of the collection. Then, in general, it's difficult to imagine how you (just you) can reason about the whole search in such a way as to "steer it around" in your preferred direction.
I think this is easy to imagine. I'm an expert who is among 10 experts recruited to advise some government on making a decision. I can guess some of the signals that the government will use to choose who among us to trust most. I can guess some of the relative weaknesses of fellow experts. I can try to use this to manipulate the government into taking my opinion more seriously. I don't need to create a clone government and hire 10 expert clones in order to do this.
The other 9 experts can also make guesses about which the signals the government will use and what the relative weaknesses of their fellow experts are, and the other 9 experts can also act on those guesses. So in order to reason about what the outcome of the search will be, you have to reason about both yourself and also about the other 9 experts, unless you somehow know that you are much better than the other 9 experts at steering the outcome of the search as a whole. But in that case only you can steer the search . The other 9 experts would fail if they tried to use the same strategy you're using.
Sure, that's also a useful thing to do sometimes. Is your contention that simple concentrated representations of resources and how they flow do not exist in the activations of LLMs that are reasoning about resources and how they flow?
If not, I think I still don't understand what sort of thing you think LLMs don't have a concentrated representation of.