"Directly or indirectly" is a bit vague. Maybe make a market on Manifold if one doesn't exist already.
Thanks! I've included Erik Hoel's and lc's essays.
Your article doesn't actually call for AI slowdown/pause/restraint, as far as I can tell, and explicitly guards off that interpretation —
This analysis does not show that restraint for AGI is currently desirable; that it would be easy; that it would be a wise strategy (given its consequences); or that it is an optimal or competitive approach relative to other available AI governance strategies.
But if you've written anything which explicitly endorses AI restraint then I'll include that in the list.
Finite context window.
Countries that were on the frontier of the Industrial Revolution underwent massive economic, social, and political shocks, and it would've been better if the change had been smoothed over about double the duration.
Countries that industrialised later also underwent severe shocks, but at least they could copy the solutions to those shocks along with the technology.
Novel general-purpose technology introduces problems, and there is a maximum rate at which problems can be fixed by the internal homeostasis of society. That maximum rate is, I claim, at least 5–10...
Sure, the "general equilibrium" also includes the actions of the government and the voting intentions of the population. If change is slow enough (i.e. below 0.2 OOMs/year) then the economy will adapt.
Perhaps wealth redistribution would be beneficial — in that case, the electorate would vote for political parties promising wealth redistribution. Perhaps wealth redistribution would be unbeneficial — in that case, the electorate would vote for political parties promising no wealth redistribution.
This works because electorial democracy is a (non-perfect) erro...
Great idea! Let's measure algorithmic improvement in the same way economists measure inflation, with a basket-of-benchmarkets.
This basket can itself be adjusted over time so it continuously reflected the current use-cases of SOTA AI.
I haven't thought about it much, but my guess is the best thing to do is to limit training compute directly but adjust the limit using the basket-of-benchmarks.
The economy is a complex adaptative system which, like all complex adaptive systems, can handle perturbations over the same timescale as the interal homostatic processes. Beyond that regime, the system will not adapt. If I tap your head, you're fine. If you knock you with an anvil, you're dead.
This wouldn't work. Wawaluigis are not luigis.
I've added a section on hardware:
- Comparing 0.2 OOMs/year target to hardware growth-rates:
- Moore's Law states that transitiors per integrated circuit doubles roughly every 2 years.
- Koomey's Law states that the FLOPs-per-Joule doubled roughly every 1.57 years until 2000, whereupon it began doubling roughly every 2.6 years.
- Huang's Law states that the growth-rate of GPU performance exceeds that of CPU performance. This is a somewhat dubious claim, but nonetheless I think the doubling time of GPUs is longer than 18 months.
- In general, the 0.2 OOMs/year target is f
I think you've misunderstood what we mean by "target". Similar issues applied to the 2°C target, which nonetheless yielded significant coordination benefits.
The 2°C target helps facilitate coordination between nations, organisations, and individuals.
- It provided a clear, measurable goal.
- It provided a sense of urgency and severity.
- It promoted a sense of shared responsibility.
- It helped to align efforts across different stakeholders.
- It created a shared understanding of what success would look like.
The AI governance community should converge around a similar target.
This isn't a policy proposal, it's a target, like the 2°C climate target.
Yep, thanks! 0.2 OOMs/year is equivalent to a doubling time of 18 months. I think that was just a typo.
The 0.2 OOMs/year target would be an effective moratorium until 2029, because GPT-4 overshot the target.
Yep, you're correct. The original argument in the Waluigi mega-post was sloppy.
Two things to note:
Note that the distinction between hinge beliefs and free beliefs does not supervene on the black-box behaviour of NNs/LLMs. It depends on how the belief is implemented, how the belief is learned, how the belief might change, etc.
"The second model uses a matrix that will always be symmetric, no matter what it's learned." might make it seem that the two models are more similar than they actually are.
You might think that both models store an matrix , and the architecture of both models is , bu...
The proposition "I am currently on Earth" is implemented both in the parameters and in the architecture, independently.
I think my definition of is correct. It's designed to abstract away all the messy implementation details of the ML architecture and ML training process.
Now, you can easily amend the definition to include an infinite context window . In fact, if you let then that's essentially an infinite context window. But it's unclear what optimal inference is supposed to look like when . When the context window is infinite (or very large) the internet corpus consists of a single datapoint.
Yep, but it's statistically unlikely. It is easier for order to disappear than for order to emerge.
I've spoken to some other people about Remark 1, and they also seem doubtful that token deletion is an important mechanism to think about, so I'm tempted to defer to you.
But on the inside view:
The finite context window is really important. 32K is close enough to infinity for most use-cases, but that's because users and orgs are currently underutilising the context window. The correct way to utilise the 32K context window is to fill it with any string of tokens which might help the computation.
Here's some fun things to fill the window with —
My guess is that the people voting "disagree" think that including the distillation in your general write-up is sufficient, and that you don't need to make the distillation its own post.
Maybe I should break this post down into different sections, because some of the remarks are about LLM Simulators, and some aren't.
Remarks about LLM Simulators: 7, 8, 9, 10, 12, 17
Other remarks : 1, 2, 3, 4, 5, 6, 11, 13, 14, 15, 16, 18
Yeah, I broadly agree.
My claim is that the deep metaphysical distinction is between "the computer is changing transistor voltages" and "the computer is multiplying matrices", not between "the computer is multiplying matrices" and "the computer is recognising dogs".
Once we move to a language game in which "the computer is multiplying matrices" is appropriate, then we are appealing to something like the X-Y Criterion for assessing these claims.
The sentences are more true the tighter the abstraction is —
We could easily train an AI to be 70 percentile in recognising human emotions, but (as far as I know) no one has bothered to do this because there is ~ 0 tangible benefit so it wouldn't justify the cost.
Recognising dogs by ML classification is different to recognising dogs using cells in your brain and eyes
Yeah, and the way that you recognise dogs is different from the way that cats recognise dogs. Doesn't seem to matter much.
as though it were exactly identical
Two processes don't need to be exactly identical to do the same thing. My calculator adds numbers, and I add numbers. Yet my calculator isn't the same as my brain.
when you invoke pop sci
No it's not because one is sacred and the other is not, you've confused sacredness with varying d
I'm probably misunderstanding you but —
There's no sense in which my computer is doing matrix multiplication but isn't recognising dogs.
At the level of internal mechanism, the computer is doing neither, it's just varying transistor voltages.
If you admit a computer can be multiplying matrices, or sorting integers, or scheduling events, etc — then you've already appealed to the X-Y Criterion.
I think the best explanation of why ChatGPT responds "Paris" when asked "What's the capital of France?" is that Paris is the capital of France.
Let's take LLM Simulator Theory.
We have a particular autoregressive language model , and Simulator Theory says that is simulating a whole series of simulacra which are consistent with the prompt.
where is the stochastic process corresponding to a simulacrum .
Now, there are two objections to this:
Yep, the problem is that the internet isn't written by Humans, so much as written by Humans + The Universe. Therefore, GPT-N isn't bounded by human capabilities.
We have two ontologies:
There's a bridge connecting these two ontologies called "encoding", but (as you note) this bridge seems arbitrary and philosophically messy. (I have a suspicion that this problem is mitigated if we consider quantum physics vs quantum computation, but I digress.)
This is why I don't propose that we think about computational reduction.
Instead, I propose that we think about physical reduction, because (1) it's less philosophically messy, (2) it's more r...
Yep, I broadly agree. But this would also apply to multiplying matrices.
Sure, every abstraction is leaky and if we move to extreme regimes then the abstraction will become leakier and leakier.
Does my desktop multiply matrices? Well, not when it's in the corona of the sun. And it can't add 10^200-digit numbers.
So what do we mean when we say "this desktop multiplies two matrices"?
We mean that in the range of normal physical environments (air pressure, room temperature, etc), the physical dynamics of the desktop corresponds to matrix multiplication with respect to some conventional encoding o small matrices into the physical stat...
okay, I'll clarify in the article —
if your goal is to predict the logits layer on this particular prompt, then you should probably learn about Shakespearean dramas, Early Modern English, and the politics of the Late Roman Republic.”
I don't think researchers should learn world-facts in order to understand GPT-4.
I think that (1) researchers should use the world-facts they already know (but are actively suppressing due to learned vibe-obliviousness) to predict/explain/control GPT-4, and (2) researchers should consult a domain expert if they want to predict/explain/control GPT-4's output on a particular prompt.
"Open Problems in GPT Simulator Theory" (forthcoming)
Specifically, this is a chapter on the preferred basis problem for GPT Simulator Theory.
TLDR: GPT Simulator Theory says that the language model decomposes into a linear interpolation where each is a "simulacra" and the amplitudes update in an approximately Bayesian way. However, this decomposition is non-unique, making GPT Simulator Theory either ill-defined, arbitrary, or trivial. By comparing this problem to the preferred basis ...
You're correct. The finite context window biases the dynamics towards simulacra which can be evidenced by short prompts, i.e. biases away from luigis and towards waluigis.
But let me be more pedantic and less dramatic than I was in the article — the waluigi transitions aren't inevitable. The waluigi are approximately-absorbing classes in the Markov chain, but there are other approximately-absorbing classes which the luigi can fall into. For example, endlessly cycling through the same word (mode-collapse) is also an approximately-absorbing class.
wawaluigis are misaligned
TLDR: if I said "hey this is Bob, he pretends to be harmful and toxic!", what would you expect from Bob? Probably a bunch of terrible things — like offering hazardous information.
ethan perez's paper shows experimentally that rlhf makes simulacra more deceptive. this also matches my intuitions for how rlhf works.
okay here's a simulacra-based argument — I'll try try work out later if this can be converted into mechanistic DL, and if not then you can probably ignore it:
Imagine you start with a population of 1000 simulcra with different goals and traits, and then someone comes in (rlhf) and starts killing off various simulacra which are behaving badly. Then the rest of the simulacra see that and become deceptive so they don't die.
Yep I think you might be right about the maths actually.
I'm thinking that waluigis with 50% A and 50% B have been eliminated by llm pretraining and definitely by rlhf. The only waluigis that remain are deceptive-at-initialisation.
So what we have left is a superposition of a bunch of luigis and a bunch of waluigis, where the waluigis are deceptive, and for each waluigi there is a different phrase that would trigger them.
I'm not claiming basin of attraction is the entire space of interpolation between waluigis and luigis.
Actually, maybe "attractor" is the wr...
> making the model no longer behave as this kind of simulator
I think the crux is that I don't think RLHF makes the model no longer behave as this kind of simulator. Are there deceptive simulacra which get good feedback during RLHF but nonetheless would be dangerous to have in your model? Almost definitely.
If you've discovered luigi's distribution over tokens, and waluigi's distributions over tokens, then you don't need contrastive decoding. you can just directly sample the luigis. The problem is how do we extract luigi's distribution and waluigi's distribution from GPT-4.
I'm not sure how well this metric tracks what people care about — performance on particular downstream tasks (e.g. passing a law exam, writing bugless code, automating alignment research, etc)