Over the past few days I've been doing a lit review of the different types of attention heads people have found and/or the metrics one can use to detect the presence of those types of heads.
Here is a rough list from my notes, sorry for the poor formatting, but I did say its rough!
A list of some contrarian takes I have:
People are currently predictably too worried about misuse risks
What people really mean by "open source" vs "closed source" labs is actually "responsible" vs "irresponsible" labs, which is not affected by regulations targeting open source model deployment.
Neuroscience as an outer alignment[1] strategy is embarrassingly underrated.
Better information security at labs is not clearly a good thing, and if we're worried about great power conflict, probably a bad thing.
Much research on deception (Anthropic's recent work, trojans, jailbreaks, etc) is not targeting "real" instrumentally convergent deception reasoning, but learned heuristics. Not bad in itself, but IMO this places heavy asterisks on the results they can get.
ML robustness research (like FAR Labs' Go stuff) does not help with alignment, and helps moderately for capabilities.
The field of ML is a bad field to take epistemic lessons from. Note I don't talk about the results from ML.
ARC's MAD seems doomed to fail.
People in alignment put too much faith in the general factor g. It exists, and is powerful, but is not all-consuming or all-predicting. People are often ver
A strange effect: I'm using a GPU in Russia right now, which doesn't have access to copilot, and so when I'm on vscode I sometimes pause expecting copilot to write stuff for me, and then when it doesn't I feel a brief amount of the same kind of sadness I feel when a close friend is far away & I miss them.
There is a mystery which many applied mathematicians have asked themselves: Why is linear algebra so over-powered?
An answer I like was given in Lloyd Trefethen's book An Applied Mathematician's Apology, in which he writes (my summary):
Everything in the real world is described fully by non-linear analysis. In order to make such systems simpler, we can linearize (differentiate) them, and use a first or second order approximation, and in order to represent them on a computer, we can discretize them, which turns analytic techniques into algebraic ones. Therefore we've turned our non-linear analysis into linear algebra.
Seems like every field of engineering is like:
Probably my biggest pet-peeve of trying to find or verify anything on the internet nowadays is that newspapers never seem to link to or cite (in any useful manner) any primary sources they use, unless weirdly if any of those primary sources come from Twitter.
There have probably been hundreds of times by now that I have seen an interesting economic or scientific claim made by The New York Times, or some other popular (or niche) newspaper, wanted to find the relevant paper, and had to spend at least 10 minutes on Google trying to search between thousands of identical newspaper articles for the one paper that actually says anything about what was actually done.
More often than not, the paper is a lot less interesting than the newspaper article is making it out to be too.
Does the possibility of China or Russia being able to steal advanced AI from labs increase or decrease the chances of great power conflict?
An argument against: It counter-intuitively decreases the chances. Why? For the same reason that a functioning US ICBM defense system would be a destabilizing influence on the MAD equilibrium. In the ICBM defense circumstance, after the shield is put up there would be no credible threat of retaliation America's enemies would have if the US were to launch a first-strike. Therefore, there would be no reason (geopolitically) for America to launch a first-strike, and there would be quite the reason to launch a first strike: namely, the shield definitely works for the present crop of ICBMs, but may not work for future ICBMs. Therefore America's enemies will assume that after the shield is put up, America will launch a first strike, and will seek to gain the advantage while they still have a chance by launching a pre-emptive first-strike.
The same logic works in reverse. If Russia were building a ICBM defense shield, and would likely complete it in the year, we would feel very scared about what would happen after that shield is up.
And the same logic wo...
I don't really know what people mean when they try to compare "capabilities advancements" to "safety advancements". In one sense, its pretty clear. The common units are "amount of time", so we should compare the marginal (probablistic) difference between time-to-alignment and time-to-doom. But I think practically people just look at vibes.
For example, if someone releases a new open source model people say that's a capabilities advance, and should not have been done. Yet I think there's a pretty good case that more well-trained open source models are better for time-to-alignment than for time-to-doom, since much alignment work ends up being done with them, and the marginal capabilities advance here is zero. Such work builds on the public state of the art, but not the private state of the art, which is probably far more advanced.
I also don't often see people making estimates of the time-wise differential impacts here. Maybe people think such things would be exfo/info-hazardous, but nobody even claims to have estimates here when the topic comes up (even in private, though people are glad to talk about their hunches for what AI will look like in 5 years, or the types of advancements necessary for AGI), despite all the work on timelines. Its difficult to do this for the marginal advance, but not so much for larger research priorities, which are the sorts of things people should be focusing on anyway.
For all the talk about bad incentive structures being the root of all evil in the world, EAs are, and I thought this even before the recent Altman situation, strikingly bad at setting up good organizational incentives. A document (even a founding one) with some text, a paper-wise powerful board with good people, a general claim to do-goodery is powerless in the face of the incentives you create when making your org. What local changes will cause people to gain more money, power, status, influence, sex, or other things they selfishly & basely desire? Which of the powerful are you partnering with, and what do their incentives look like?
You don't need incentive-purity here, but for every bad incentive you have, you must put more pressure on your good people & culture to forego their base & selfish desires for high & altruistic ones, and fight against those who choose the base & selfish desires and are potentially smarter & wealthier than your good people.
Quick prediction so I can say "I told you so" as we all die later: I think all current attempts at mechanistic interpretability do far more for capabilities than alignment, and I am not persuaded by arguments of the form "there are far more capabilities researchers than mechanistic interpretability researchers, so we should expect MI people to have ~0 impact on the field". Ditto for modern scalable oversight projects, and anything having to do with chain of thought.
Sometimes people say releasing model weights is bad because it hastens the time to AGI. Is this true?
I can see why people dislike non-centralized development of AI, since it makes it harder to control those developing the AGI. And I can even see why people don't like big labs making the weights of their AIs public due to misuse concerns (even if I think I mostly disagree).
But much of the time people are angry at non-open-sourced, centralized, AGI development efforts like Meta or X.ai (and others) releasing model weights to the public.
In neither of these cases however did the labs have any particular very interesting insight into architecture or training methodology (to my knowledge) which got released via the weight sharing, so I don't think time-to-AGI got shortened at all.
I agree that releasing the Llama or Grok weights wasn't particularly bad from a speeding up AGI perspective. (There might be indirect effects like increasing hype around AI and thus investment, but overall I think those effects are small and I'm not even sure about the sign.)
I also don't think misuse of public weights is a huge deal right now.
My main concern is that I think releasing weights would be very bad for sufficiently advanced models (in part because of deliberate misuse becoming a bigger deal, but also because it makes most interventions we'd want against AI takeover infeasible to apply consistently---someone will just run the AIs without those safeguards). I think we don't know exactly how far away from that we are. So I wish anyone releasing ~frontier model weights would accompany that with a clear statement saying that they'll stop releasing weights at some future point, and giving clear criteria for when that will happen. Right now, the vibe to me feels more like a generic "yay open-source", which I'm worried makes it harder to stop releasing weights in the future.
(I'm not sure how many people I speak for here, maybe some really do think it speeds up timelines.)
Robin Hanson has been writing regularly, at about the same quality for almost 20 years. Tyler Cowen too, but personally Robin has been much more influential intellectually for me. It is actually really surprising how little his insights have degraded via return-to-the-mean effects. Anyone else like this?
Last night I had a horrible dream: That I had posted to LessWrong a post filled with useless & meaningless jargon without noticing what I was doing, then I went to slee, and when I woke up I found I had karma on the post. When I read the post myself I noticed how meaningless the jargon was, and I myself couldn't resist giving it a strong-downvote.
Some have pointed out seemingly large amounts of status-anxiety EAs generally have. My hypothesis about what's going on:
A cynical interpretation: for most people, altruism is significantly motivated by status-seeking behavior. It should not be all that surprising if most effective altruists are motivated significantly by status in their altruism. So you've collected several hundred people all motivated by status into the same subculture, but status isn't a positive-sum good, so not everyone can get the amount of status they want, and we get the above dynamic: people get immense status anxiety compared to alternative cultures because in alternative situations they'd just climb to the proper status-level in their subculture, out-competing those who care less about status. But here, everyone cares about status to a large amount, so those who would have out-competed others in alternate situations are unable to and feel bad about it.
The solution?
One solution given this world is to break EA up into several different sub-cultures. On a less grand, more personal, scale, you could join a subculture outside EA and status-climb to your heart's content in there.
Preferably a subculture with very few status-seekers, but with large amounts of status to give. Ideas for such subcultures?
An interesting strategy, which seems related to FDT's prescription to ignore threats, which seems to have worked:
From the very beginning, the People’s Republic of China had to maneuver in a triangular relationship with the two nuclear powers, each of which was individually capable of posing a great threat and, together, were in a position to overwhelm China. Mao dealt with this endemic state of affairs by pretending it did not exist. He claimed to be impervious to nuclear threats; indeed, he developed a public posture of being willing to accept hundreds of millions of casualties, even welcoming it as a guarantee for the more rapid victory of Communist ideology. Whether Mao believed his own pronouncements on nuclear war it is impossible to say. But he clearly succeeded in making much of the rest of the world believe that he meant it—an ultimate test of credibility.
From Kissinger's On China, chapter 4 (loc 173.9).
My latest & greatest project proposal, in case people want to know what I'm doing, or give me money. There will likely be a LessWrong post up soon where I explain in a more friendly way my thoughts.
...Over the next year I propose to study the development and determination of values in RL & supervised learning agents, and to expand the experimental methods & theory of singular learning theory (a theory of supervised learning) to the reinforcement learning case.
All arguments for why we should expect AI to result in an existential risk rely on AIs ha
Since it seems to be all the rage nowadays, due to Aschenbrenner's Situational Awareness, here's a Manifold market I created on when the first (or whether any) AGI company will be "nationalized".
I would be in the never camp, unless the AI safety policy people get their way. But I don't like betting in my own markets (it makes them more difficult to judge in the case of an edge-case).
In particular, 25% chance of nationalization by EOY 2040.
I think in fast-takeoff worlds, the USG won't be fast enough to nationalize the industry, and in slow-takeoff worlds, the USG will pursue regulation on the level of military contractors of such companies, but won't nationalize them. I mainly think this because this is the way the USG usually treats military contractors (including strict & mandatory security requirements, and gatekeeping the industry), and really its my understanding of how it treats most projects it wants to get done which it doesn't already have infrastructure in place to complete.
Nationalization, in the US, is just very rare.
Even during world war 2, my understanding is very few industries---even those vital to the war effort---were nationalized. People love talking about the Manhattan Project, but that was not an industry that was nationalized, that was a research project started by & for the government. AI is a billion-dollar industry. The AGI labs (their people, leaders, and stock-holders [or in OAI's case, their profit participation unit holders]) are not just going to sit idly by as they're taken over.
And neither may the nation...
If Adam is right, and the only way to get great at research is long periods of time with lots of mentor feedback, then MATS should probably pivot away from the 2-6 month time-scales they've been operating at, and toward 2-6 year timescales for training up their mentees.
Seems like the thing to do is to have a program that happens after MATS, not to extend MATS. I think in-general you want sequential filters for talent, and ideally the early stages are as short as possible (my guess is indeed MATS should be a bit shorter).
A Theory of Usable Information Under Computational Constraints
...We propose a new framework for reasoning about information in complex systems. Our foundation is based on a variational extension of Shannon's information theory that takes into account the modeling power and computational constraints of the observer. The resulting \emph{predictive V-information} encompasses mutual information and other notions of informativeness such as the coefficient of determination. Unlike Shannon's mutual information and in violation of the data processing inequality, V-
My reading is their definition of conditional predictive entropy is the naive generalization of Shannon's conditional entropy given that the way that you condition on some data is restricted to only being able to implement functions of a particular class. And the corresponding generalization of mutual information becomes a measure of how much more predictable does some piece of information become (Y) given evidence (X) compared to no evidence.
For example, the goal of public key cryptography cannot be to make the mutual information between a plaintext, and public key & encrypted text zero, while maintaining maximal mutual information between the encrypted text and plaintext given the private key, since this is impossible.
Cryptography instead assumes everyone involved can only condition their probability distributions using polynomial time algorithms of the data they have, and in that circumstance you can minimize the predictability of your plain text after getting the public key & encrypted text, while maximizing the predictability of the plain text after getting the private key & encrypted text.
More mathematically, they assume you can only implement functions from your ...
From The Guns of August
...Old Field Marshal Moltke in 1890 foretold that the next war might last seven years—or thirty—because the resources of a modern state were so great it would not know itself to be beaten after a single military defeat and would not give up [...] It went against human nature, however—and the nature of General Staffs—to follow through the logic of his own prophecy. Amorphous and without limits, the concept of a long war could not be scientifically planned for as could the orthodox, predictable, and simple solution of decisive battle an
Yesterday I had a conversation with a person very much into cyborgism, and they told me about a particular path to impact floating around the cyborgism social network: Evals.
I really like this idea, and I have no clue how I didn't think of it myself! Its the obvious thing to do when you have a bunch of insane people (used as a term of affection & praise by me for such people) obsessed with language models, who are also incredibly good & experienced at getting the models to do whatever they want. I would trust these people red-teaming a model and te...
More evidence for in-context RL, in case you were holding out for mechanistic evidence LLMs do in-context internal-search & optimization.
...In-context learning, the ability to adapt based on a few examples in the input prompt, is a ubiquitous feature of large language models (LLMs). However, as LLMs' in-context learning abilities continue to improve, understanding this phenomenon mechanistically becomes increasingly important. In particular, it is not well-understood how LLMs learn to solve specific classes of problems, such as reinforcement learning (R
Progress in neuromorphic value theory
...Animals perform flexible goal-directed behaviours to satisfy their basic physiological needs1,2,3,4,5,6,7,8,9,10,11,12. However, little is known about how unitary behaviours are chosen under conflicting needs. Here we reveal principles by which the brain resolves such conflicts between needs across time. We developed an experimental paradigm in which a hungry and thirsty mouse is given free choices between equidistant food and water. We found that mice collect need-appropriate rewards by structuring their choices into p