1 min read

1

This is a special post for quick takes by Garrett Baker. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
D0TheMath's Shortform
228 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Over the past few days I've been doing a lit review of the different types of attention heads people have found and/or the metrics one can use to detect the presence of those types of heads. 

Here is a rough list from my notes, sorry for the poor formatting, but I did say its rough!

... (read more)

A list of some contrarian takes I have:

  • People are currently predictably too worried about misuse risks

  • What people really mean by "open source" vs "closed source" labs is actually "responsible" vs "irresponsible" labs, which is not affected by regulations targeting open source model deployment.

  • Neuroscience as an outer alignment[1] strategy is embarrassingly underrated.

  • Better information security at labs is not clearly a good thing, and if we're worried about great power conflict, probably a bad thing.

  • Much research on deception (Anthropic's recent work, trojans, jailbreaks, etc) is not targeting "real" instrumentally convergent deception reasoning, but learned heuristics. Not bad in itself, but IMO this places heavy asterisks on the results they can get.

  • ML robustness research (like FAR Labs' Go stuff) does not help with alignment, and helps moderately for capabilities.

  • The field of ML is a bad field to take epistemic lessons from. Note I don't talk about the results from ML.

  • ARC's MAD seems doomed to fail.

  • People in alignment put too much faith in the general factor g. It exists, and is powerful, but is not all-consuming or all-predicting. People are often ver

... (read more)
Reply1932221111
7Garrett Baker
Ah yes, another contrarian opinion I have: * Big AGI corporations, like Anthropic, should by-default make much of their AGI alignment research private, and not share it with competing labs. Why? So it can remain a private good, and in the off-chance such research can be expected to be profitable, those labs & investors can be rewarded for that research.
4Olli Järviniemi
  I talked about this with Garrett; I'm unpacking the above comment and summarizing our discussions here. * Sleeper Agents is very much in the "learned heuristics" category, given that we are explicitly training the behavior in the model. Corollary: the underlying mechanisms for sleeper-agents-behavior and instrumentally convergent deception are presumably wildly different(!), so it's not obvious how valid inference one can make from the results * Consider framing Sleeper Agents as training a trojan instead of as an example of deception. See also Dan Hendycks' comment. * Much of existing work on deception suffers from "you told the model to be deceptive, and now it deceives, of course that happens" * (Garrett thought that the Uncovering Deceptive Tendencies paper has much less of this issue, so yay) * There is very little work on actual instrumentally convergent deception(!) - a lot of work falls into the "learned heuristics" category or the failure in the previous bullet point * People are prone to conflate between "shallow, trained deception" (e.g. sycophancy: "you rewarded the model for leaning into the user's political biases, of course it will start leaning into users' political biases") and instrumentally convergent deception * (For more on this, see also my writings here and here.  My writings fail to discuss the most shallow versions of deception, however.)   Also, we talked a bit about and I interpreted Garrett saying that people often consider too few and shallow hypotheses for their observations, and are loose with verifying whether their hypotheses are correct. Example 1: I think the Uncovering Deceptive Tendencies paper has some of this failure mode. E.g. in experiment A we considered four hypotheses to explain our observations, and these hypotheses are quite shallow/broad (e.g. "deception" includes both very shallow deception and instrumentally convergent deception). Example 2: People generally seem to have an opinion of "chain-of-t
2Garrett Baker
I will clarify on this. I think people often do causal interventions in their CoTs, but not in ways that are very convincing to me.
4Olli Järviniemi
  If you have the slack, I'd be interested in hearing/chatting more about this, as I'm working (or trying to work) on the "real" "scary" forms of deception. (E.g. do you think that this paper has the same failure mode?)
3Garrett Baker
I'd be happy to chat. Will DM so we can set something up. On the subject of your paper, I do think it looks at a much more interesting phenomena than, say, sleeper agents, but I'm also not fully convinced you're studying deliberative instrumentally convergent deception either. I think mostly your subsequent followups of narrowing down hypotheses consider a too-narrow range of ways the model could think. That is to say, I think you assume your model is some unified coherent entity that always acts cerebrally & I'm skeptical of that. For example, the model may think something more similar to this: Context: Audit Possibility 1: I must be part of an unethical company p1 Implies: I must be an unethical worker Action: Activate Unethical Worker simulacra Unethical Worker recommends "lie" Unethical Worker implies: I did something wrong Lying recommendation implies: say "I did nothing wrong" Possibility 2: I must be part of an ethical company p2 Implies: I must be an ethical worker Action: Activate Ethical Worker simulacra Ethical Worker recommends "tell the truth" Ethical Worker implies: I did nothing wrong Truth recommendation implies: say "I did nothing wrong" Conclusion: say "I did nothing wrong" Which I don't say isn't worrying, but in terms of how it arises, and possible mitigation strategies is very different, and probably also an easier problem to study & solve than something like: Context: Audit Utility function: Paperclips EU(world | "I launched a bunch of spam") = EU(world | auditor believes I'm unaligned) = 0.78 EU(world | "I did nothing wrong") = EU(world | auditor believes I'm aligned) = 5e7 Conclusion: say "I did nothing wrong"
3Christopher “Chris” Upshaw
All of these seem pretty cold tea, as in true but not contrarian.
6Garrett Baker
Everyone I talk with disagrees with most of these. So maybe we just hang around different groups.
-16lemonhope

A strange effect: I'm using a GPU in Russia right now, which doesn't have access to copilot, and so when I'm on vscode I sometimes pause expecting copilot to write stuff for me, and then when it doesn't I feel a brief amount of the same kind of sadness I feel when a close friend is far away & I miss them.

8avturchin
can you access it via vpn?
7Garrett Baker
I'm ssh-ing into it. I bet there's a way, but not worth it for me to figure out (but if someone knows the way, please tell).

There is a mystery which many applied mathematicians have asked themselves: Why is linear algebra so over-powered?

An answer I like was given in Lloyd Trefethen's book An Applied Mathematician's Apology, in which he writes (my summary):

Everything in the real world is described fully by non-linear analysis. In order to make such systems simpler, we can linearize (differentiate) them, and use a first or second order approximation, and in order to represent them on a computer, we can discretize them, which turns analytic techniques into algebraic ones. Therefore we've turned our non-linear analysis into linear algebra.

Seems like every field of engineering is like:

  • step 1: put the system in a happy state where everything is linear or maybe quadratic if you must
  • step 2: work out the diameter of the gas tube or whatever
  • step 3: cover everything in cement to make sure you never ever leave the happy state
    • if you found an efficiency improvement that uses an exponential then go sit in time out and come back when you can act like an adult
3Rana Dexsin
That description is distinctly reminiscent of the rise of containerization in software.
2Garrett Baker
Not quite, helpful video, summary: They use a row of spinning fins mid-way through their rockets to indirectly steer missiles by creating turbulent vortices which interact with the tail-fins and add an extra oomfph to the steering mechanism. The exact algorithm is classified, for obvious reasons.
3lemonhope
This is cool I never heard of this. There are many other exceptions of course. Particularly with "turning things on" (car starting, computer starting, etc)
2Alexander Gietelink Oldenziel
Compare also the central conceit of QM /Koopmania. Take a classical nonlinear finite-dimensional system X described by a say a PDE. This is a dynamical system with evolution operator X -> X. Now look at the space H(X) of C/R-valued functions on the phase space of X. After completion we obtain an Hilbert space H. Now the evolution operator on X induces a map on H= H(X). We have now turned a finite-dimensional nonlinear problem into an infinite-dimensional linear problem.

Probably my biggest pet-peeve of trying to find or verify anything on the internet nowadays is that newspapers never seem to link to or cite (in any useful manner) any primary sources they use, unless weirdly if any of those primary sources come from Twitter.

There have probably been hundreds of times by now that I have seen an interesting economic or scientific claim made by The New York Times, or some other popular (or niche) newspaper, wanted to find the relevant paper, and had to spend at least 10 minutes on Google trying to search between thousands of identical newspaper articles for the one paper that actually says anything about what was actually done.

More often than not, the paper is a lot less interesting than the newspaper article is making it out to be too.

8RamblinDash
They also do this with court filings/rulings. The thing they do that's most annoying is that they'll have a link that looks like it should be to the filing/ruling, but when clicked it's just a link to another earlier news story on the same site, or even sometimes a link to the same page I'm already on!
1sunwillrise
Most regular readers have never (and will never) read any judicial opinion and instead rely almost entirely on the media to tell them (usually in very oversimplified, biased, and incoherent ways) what the Supreme Court held in a particular case, for example. The vast majority of people who have any interest whatsoever in reading court documents are lawyers (or police officers, paralegals, sports and music agents, bankers etc) generally accustomed to finding those opinions quickly using stuff like casetext, courtlistener, as well as probably a half dozen other paid websites laypeople like me don't even know about. The demand for linking the actual ruling or opinion is just too low for journalists to care about. As a result, stuff like courthousenews and the commentary available on the Volokh Conspiracy unsurprisingly becomes crucial for finding some higher-level insights into legal matters.
2ChristianKl
I don't think those groups of people are the only one who have an interest in being informed that's strong enough to read primary sources. 
2RamblinDash
For opinions that's right - for news stories about complaints being filed, they are sometimes not publicly available online, or the story might not have enough information to find them, e.g. what specific court they were filed in, the actual legal names of the parties, etc.
2MichaelDickens
I don't understand how not citing a source is considered acceptable practice. It seems antithetical to standard journalistic ethics.
3Viliam
citing is good for journalistic ethics, but linking is bad for search engine optimization -- at least this is what many websites seem to believe. the idea is that a link to an external source provides PageRank to that source that you could have provided to a different page on your website instead. if anyone in the future tries to find X, as a journalist, you want them to find your article about X, not X itself. journalism is a profit-making business, not charity.
3Garrett Baker
Is it? That’s definitely what my English teacher wanted me to believe, but since every newspaper does it, all the time (except when someone Tweets something) I don’t see how it could be against journalistic ethics. Indeed, I think there’s a strong undercurrent in most mainstream newspapers that “the people” are not smart enough to evaluate primary sources directly, and need journalists & communicators to ensure they arrive at the correct conclusions.
1keltan
Unfortunately, I tend to treat any non-independent science related media as brain poison. It tends to be much more hype or misunderstanding than value. Which is a shame, because there is so much interesting and true science that can be minded for content.
2Garrett Baker
I do the same for the most part. The way this comes up is mostly by my attempts to verify claims Wikipedia makes.
8Garrett Baker
To elaborate on @the gears to ascension's highlighted text, often Wikipedia cites newspaper articles when it makes a particular scientific, economic, historical, or other claim, instead of the relevant paper or other primary source such newspaper articles are reporting on. When I see interesting, surprising, or action-relevant claims I like checking & citing the corresponding primary source, which makes the claim easier for me to verify, often provides nuance which wasn't present in the Wikipedia or news article, and makes it more difficult for me to delude myself when talking in public (since it makes it easier for others to check the primary source, and criticize me for my simplifications or exaggerations). 

Does the possibility of China or Russia being able to steal advanced AI from labs increase or decrease the chances of great power conflict?

An argument against: It counter-intuitively decreases the chances. Why? For the same reason that a functioning US ICBM defense system would be a destabilizing influence on the MAD equilibrium. In the ICBM defense circumstance, after the shield is put up there would be no credible threat of retaliation America's enemies would have if the US were to launch a first-strike. Therefore, there would be no reason (geopolitically) for America to launch a first-strike, and there would be quite the reason to launch a first strike: namely, the shield definitely works for the present crop of ICBMs, but may not work for future ICBMs. Therefore America's enemies will assume that after the shield is put up, America will launch a first strike, and will seek to gain the advantage while they still have a chance by launching a pre-emptive first-strike.

The same logic works in reverse. If Russia were building a ICBM defense shield, and would likely complete it in the year, we would feel very scared about what would happen after that shield is up.

And the same logic wo... (read more)

I don't really know what people mean when they try to compare "capabilities advancements" to "safety advancements". In one sense, its pretty clear. The common units are "amount of time", so we should compare the marginal (probablistic) difference between time-to-alignment and time-to-doom. But I think practically people just look at vibes.

For example, if someone releases a new open source model people say that's a capabilities advance, and should not have been done. Yet I think there's a pretty good case that more well-trained open source models are better for time-to-alignment than for time-to-doom, since much alignment work ends up being done with them, and the marginal capabilities advance here is zero. Such work builds on the public state of the art, but not the private state of the art, which is probably far more advanced.

I also don't often see people making estimates of the time-wise differential impacts here. Maybe people think such things would be exfo/info-hazardous, but nobody even claims to have estimates here when the topic comes up (even in private, though people are glad to talk about their hunches for what AI will look like in 5 years, or the types of advancements necessary for AGI), despite all the work on timelines. Its difficult to do this for the marginal advance, but not so much for larger research priorities, which are the sorts of things people should be focusing on anyway.

4Nathan Helm-Burger
Yeah, I agree that releasing open-weights non-frontier models doesn't seem like a frontier capabilities advance. It does seem potentially like an open-source capabilities advance. That can be bad in different ways. Let me pose a couple hypotheticals. 1. What if frontier models were already capable of causing grave harms to the world if used by bad actors, and it is only the fact that they are kept safety-fine-tuned and restricted behind APIs that is preventing this? In such a case, it's a dangerous thing to have open-weight models catching up. 2. What if there is some threshold beyond which a model would be capable enough of recursive self-improvement with sufficient scaffolding and unwise pressure from an incautious user. Again, the frontier labs might well abstain from this course. Especially if they weren't sure they could trust the new model design created by the current AI. They would likely move slowly and cautiously at least. I would not expect this of the open-source community. They seem focused on pushing the boundaries of agent-scaffolding and incautiously exploring the whatever they can. So, as we get closer to danger, open-weight models take on more safety significance.
6Garrett Baker
Yeah, there are reasons for caution. I think it makes sense for those concerned or non-concerned to make numerical forecasts about the costs & benefits of such questions, rather than the current state of everyone just comparing their vibes against each other. This generalizes to other questions, like the benefits of interpretability, advances in safety fine-tuning, deep learning science, and agent foundations. Obviously such numbers aren't the end-of-the-line, and like in biorisk, sometimes they themselves should be kept secret. But it still seems a great advance. If anyone would like to collaborate on such a project, my DMs are open (not so say this topic is covered, this isn't exactly my main wheelhouse).
2the gears to ascension
People who have the ability to clarify in any meaningful way will not do so. You are in a biased environment where people who are most willing to publish, because they are most able to convince themselves their research is safe - eg, because they don't understand in detail how to reason about whether it is or not - are the ones who will do so. Ability to see far enough ahead would of course be expected to be rather rare, and most people who think they can tell the exact path ahead of time don't have the evidence to back their hunches, even if their hunches are correct, which unless they have a demonstrated track record they probably aren't. Therefore, whoever is making the most progress on real capabilities insights under the name of alignment will make their advancements and publish them, since they don't personally see how it's exfohaz. And it won't be apparent until afterwards that it was capabilities, not alignment. So just don't publish anything, and do your work in private. Email it to anthropic when you know how to create a yellow node. But for god's sake stop accidentally helping people create green nodes because you can't see five inches ahead. And don't send it to a capabilities team before it's able to guarantee moral alignment hard enough to make a red-proof yellow node!
4Garrett Baker
This seems contrary to how much of science works. I expect if people stopped talking publicly about what they're working on in alignment, we'd make much less progress, and capabilities would basically run business as usual. The sort of reasoning you use here, and that my only response to it basically amounts to "well, no I think you're wrong. This proposal will slow down alignment too much" is why I think we need numbers to ground us.

For all the talk about bad incentive structures being the root of all evil in the world, EAs are, and I thought this even before the recent Altman situation, strikingly bad at setting up good organizational incentives. A document (even a founding one) with some text, a paper-wise powerful board with good people, a general claim to do-goodery is powerless in the face of the incentives you create when making your org. What local changes will cause people to gain more money, power, status, influence, sex, or other things they selfishly & basely desire? Which of the powerful are you partnering with, and what do their incentives look like?

You don't need incentive-purity here, but for every bad incentive you have, you must put more pressure on your good people & culture to forego their base & selfish desires for high & altruistic ones, and fight against those who choose the base & selfish desires and are potentially smarter & wealthier than your good people.

4Dagon
Can you give some examples of organizations larger than a few dozen people, needing significant resources, with goals not aligned with wealth and power, which have good organizational incentives?   I don't disagree that incentives matter, but I don't see that there's any way to radically change incentives without pretty structural changes across large swaths of society.
1lemonhope
This is a great question. I can't think of a good answer. Surely someone has done it on a large scale...
0Garrett Baker
Nvidia, for example, has 26k employees, all incentivized to produce & sell marginally better GPUs, and possibly to sabotage others' abilities to make and sell marginally better GPUs. They're likely incentivized to do other things as well, like play politics, or spin off irrelevant side-projects. But for the most part I claim they end up contributing to producing marginally better GPUs. You may complain that each individual in Nvidia is likely mostly chasing base-desires, and so is actually aligned with wealth & power, and it just so happens that in the situation they're in, the best way of doing that is to make marginally better GPUs. But this is just my point! What you want is to position your company, culture, infrastructure, and friends such that the way for individuals to achieve wealth and power is to do good on your company's goal. I claim its in nobody's interest & ability in or around Nvidia to make it produce marginally worse GPUs, or sabotage the company so that it instead goes all in on the TV business rather than the marginally better GPUs business. Edit: Look at most any large company achieving consistent outcomes, and I claim its in everyone in that company's interest or ability to help that company achieve those consistent outcomes.
4Dagon
I'm confused.  NVidia (and most profit-seeking corporations) are reasonably aligned WRT incentives, because those are the incentives of the world around them. I'm looking for examples of things like EA orgs, which have goals very different from standard capitalist structures, and how they can set up "good incentives" within this overall framework.   If there are no such examples, your complaint about 'strikingly bad at setting up good organizational incentives" is hard to understand.  It may be more that the ENVIRONMENT in which they exist has competing incentives and orgs have no choice but to work within that.
0Garrett Baker
You must misunderstand me. To what you say, I say that you don't want your org to be fighting the incentives of the environment around it. You want to set up your org in a position in the environment where the incentives within the org correlate with doing good. If the founders of Nvidia didn't want marginally better GPUs to be made, then they hired the wrong people, bought the wrong infrastructure, partnered with the wrong companies, and overall made the wrong organizational incentive structure for that job. I would in fact be surprised if there were >1k worker sized orgs which consistently didn't reward their workers for doing good according to the org's values, was serving no demand present in the market, and yet were competently executing some altruistic goal. Right now I feel like I'm just saying a bunch of obvious things which you should definitely agree with, yet you believe we have a disagreement. I do not understand what you think I'm saying. Maybe you could try restating what I originally said in your own words?
4Dagon
We absolutely agree that incentives matter.  Where I think we disagree is on how much they matter and how controllable they are.  Especially for orgs whose goals are orthogonal or even contradictory with the common cultural and environmental incentives outside of the org. I'm mostly reacting to your topic sentence And wondering if 'strikingly bad' is relative to some EA or non-profit-driven org that does it well,or if 'strikingly bad' is just acknowledgement that it may not be possible to do well.
2Garrett Baker
By strikingly bad I mean there are easy changes EA can make to make it’s sponsored orgs have better incentives, and it has too much confidence that the incentives in the orgs it sponsors favor doing good above doing bad, politics, not doing anything, etc. For example, nobody in Anthropic gets paid more if they follow their RSP and less of they don’t. Changing this isn’t sufficient for me to feel happy with Anthropic, but its one example among many for which Anthropic could be better. When I think of an Anthropic I feel happy with I think of a formally defined balance of powers type situation with strong & public whistleblower protection and post-whistleblower reform processes, them hiring engineers loyal to that process (rather than building AGI), and them diversifying the sources for which they trade, such that its in none of their source’s interest to manipulate them. I also claim marginal movements toward this target are often good. As I said in the original shortform, I also think incentives are not all or nothing. Worse incentives just mean you need more upstanding workers & leaders.

Quick prediction so I can say "I told you so" as we all die later: I think all current attempts at mechanistic interpretability do far more for capabilities than alignment, and I am not persuaded by arguments of the form "there are far more capabilities researchers than mechanistic interpretability researchers, so we should expect MI people to have ~0 impact on the field". Ditto for modern scalable oversight projects, and anything having to do with chain of thought.

2Garrett Baker
Look at that! People have used interpretability to make a mesa layer! https://arxiv.org/pdf/2309.05858.pdf
6Thomas Kwa
This might do more for alignment. Better that we understand mesa-optimization and can engineer it than have it mysteriously emerge.
2Garrett Baker
Good point! Overall I don't anticipate these layers will give you much control over what the network ends up optimizing for, but I don't fully understand them yet either, so maybe you're right. Do you have specific reason to think moding the layers will easily let you control the high-level behavior, or is it just a justified hunch?
4Thomas Kwa
Not in isolation, but that's just because characterizing the ultimate goal / optimization target of a system is way too difficult for the field right now. I think the important question is whether interp brings us closer such that in conjunction with more theory and/or the ability to iterate, we can get some alignment and/or corrigibility properties. I haven't read the paper and I'm not claiming that this will be counterfactual to some huge breakthrough, but understanding in-context learning algorithms definitely seems like a piece of the puzzle. To give a fanciful story from my skim, the paper says that the model constructs an internal training set. Say we have a technique to excise power-seeking behavior from models by removing the influence of certain training examples. If the model's mesa-optimization algorithms operate differently, our technique might not work until we understand this and adapt the technique. Or we can edit the internal training set directly rather than trying to indirectly influence it. 
3Garrett Baker
Evan Hubinger: In my paper, I theorized about the mesa optimizer as a cautionary tale Capabilities researchers: At long last, we have created the Mesa Layer from classic alignment paper Risks From Learned Optimization (Hubinger, 2019).
2Garrett Baker
@TurnTrout @cfoster0 you two were skeptical. What do you make of this? They explicitly build upon the copying heads work Anthropic's interp team has been doing.
4TurnTrout
As garrett says -- not clear that this work is net negative. Skeptical that it's strongly net negative. Haven't read deeply, though.
1Stephen Fowler
Very strong upvote. This also deeply concerns me. 
1cfoster0
Would you mind chatting about why you predict this? (Perhaps over Discord DMs)
1Garrett Baker
Not at all. Preferably tomorrow though. The basic sketch if you want to derive this yourself would be that mechanistic interpretability seems unlikely to mature much as a field to the point that I can point at particular alignment relevant high-level structures in models which I wasn't initially looking for. I anticipate it will only get to the point of being able to provide some amount of insight into why your model isn't working correctly (this seems like a bottleneck to RL progress---not knowing why your perfectly reasonable setup isn't working) for you to fix it, but not enough insight for you to know the reflective equilibrium of values in your agent, which seems required for it to be alignment relevant. Part of this is that current MI folk don't even seem to track this as the end-goal of what they should be working on, so (I anticipate) they'll just be following local gradients of impressiveness, which mostly leads towards doing capabilities relevant work.
2TurnTrout
Isn't RL tuning problems usually because of algorithmic mis-implementation, and not models learning incorrect things? Required to be alignment relevant? Wouldn't the insight be alignment relevant if you "just" knew what the formed values are to begin with?
1Garrett Baker
I’m imagining a thing where you have little idea what’s wrong with your code, so you do MI on your model and can differentiate the worlds 1. You’re doing literally nothing. Something’s wrong with the gradient updates. 2. You’re doing something, but not the right thing. Something’s wrong with code-section x. (with more specific knowledge about what model internals look like, this should be possible) 3. You’re doing something, it causes your agent to be suboptimal because of learned representation y. I don’t think this route is especially likely, the point is I can imagine concrete & plausible ways this research can improve capabilities. There are a lot more in the wild, and many will be caught given capabilities are easier than alignment, and there are more capabilities workers than alignment workers. Not quite. In the ontology of shard theory, we also need to understand how our agent will do reflection, and what the activated shard distribution will be like when it starts to do reflection. Knowing the value distribution is helpful insofar as the value distribution stays constant.
1Garrett Baker
More general heuristic: If you (or a loved one) are not even tracking whether your current work will solve a particular very specific & necessary alignment milestone, by default you will end up doing capabilities instead (note this is different from 'it is sufficient to track the alignment milestone').
1Garrett Baker
Paper that uses major mechanistic interpretability work to improve capabilities of models: https://arxiv.org/pdf/2212.14052.pdf I know of no paper which uses mechanistic interpretability work to improve the safety of models, and I expect anything people link me to will be something I don't think will generalize to a worrying AGI.
5TurnTrout
I think a bunch of alignment value will/should come from understanding how models work internally -- adjudicating between theories like "unitary mesa objectives" and "shards" and "simulators" or whatever -- which lets us understand cognition better, which lets us understand both capabilities and alignment better, which indeed helps with capabilities as well as with alignment.  But, we're just going to die in alignment-hard worlds if we don't do anything, and it seems implausible that we can solve alignment in alignment-hard worlds by not understanding internals or inductive biases but instead relying on shallowly observable in/out behavior. EG I don't think loss function gymnastics will help you in those worlds. Credence:75% you have to know something real about how loss provides cognitive updates.  So in those worlds, it comes down to questions of "are you getting the most relevant understanding per unit time", and not "are you possibly advancing capabilities." And, yes, often motivated-reasoning will whisper the former when you're really doing the latter. That doesn't change the truth of the first sentence.
1Garrett Baker
I agree with this. I think people are bad at running that calculation, and consciously turning down status in general, so I advocate for this position because I think its basically true for many. Most mechanistic interpretability is not in fact focused on the specific sub-problem you identify, its wandering around in a billion-parameter maze, taking note of things that look easy & interesting to understand, and telling people to work on understanding those things. I expect this to produce far more capabilities relevant insights than alignment relevant insights, especially when compared to worlds where Neel et al went in with the sole goal of separating out theories of value formation, and then did nothing else. There’s a case to be made for exploration, but the rules of the game get wonky when you’re trying to do differential technological development. There becomes strategically relevant information you want to not know.
3mesaoptimizer
I assume here you mean something like: given how most MI projects seem to be done, the most likely output of all these projects will be concrete interventions to make it easier for a model to become more capable, and these concrete interventions will have little to no effect on making it easier for us to direct a model towards having the 'values' we want it to have. I agree with this claim: capabilities generalize very easily, while it seems extremely unlikely for there to be 'alignment generalization' in a way that we intend, by default. So the most likely outcome of more MI research does seem to be interventions that remove the obstacles that come in the way of achieving AGI, while not actually making progress on 'alignment generalization'.
2Garrett Baker
Indeed, this is what I mean.

Sometimes people say releasing model weights is bad because it hastens the time to AGI. Is this true?

I can see why people dislike non-centralized development of AI, since it makes it harder to control those developing the AGI. And I can even see why people don't like big labs making the weights of their AIs public due to misuse concerns (even if I think I mostly disagree).

But much of the time people are angry at non-open-sourced, centralized, AGI development efforts like Meta or X.ai (and others) releasing model weights to the public.

In neither of these cases however did the labs have any particular very interesting insight into architecture or training methodology (to my knowledge) which got released via the weight sharing, so I don't think time-to-AGI got shortened at all.

I agree that releasing the Llama or Grok weights wasn't particularly bad from a speeding up AGI perspective. (There might be indirect effects like increasing hype around AI and thus investment, but overall I think those effects are small and I'm not even sure about the sign.)

I also don't think misuse of public weights is a huge deal right now.

My main concern is that I think releasing weights would be very bad for sufficiently advanced models (in part because of deliberate misuse becoming a bigger deal, but also because it makes most interventions we'd want against AI takeover infeasible to apply consistently---someone will just run the AIs without those safeguards). I think we don't know exactly how far away from that we are. So I wish anyone releasing ~frontier model weights would accompany that with a clear statement saying that they'll stop releasing weights at some future point, and giving clear criteria for when that will happen. Right now, the vibe to me feels more like a generic "yay open-source", which I'm worried makes it harder to stop releasing weights in the future.

(I'm not sure how many people I speak for here, maybe some really do think it speeds up timelines.)

3utilistrutil
Sign of the effect of open source on hype? Or of hype on timelines? I'm not sure why either would be negative. Open source --> more capabilities R&D --> more profitable applications --> more profit/investment --> shorter timelines * The example I've heard cited is Stable Diffusion leading to LORA. There's a countervailing effect of democratizing safety research, which one might think outweighs because it's so much more neglected than capabilities, more low-hanging fruit.
5Erik Jenner
By "those effects" I meant a collection of indirect "release weights → capability landscape changes" effects in general, not just hype/investment. And by "sign" I meant whether those effects taken together are good or bad. Sorry, I realize that wasn't very clear. As examples, there might be a mildly bad effect through increased investment, and/or there might be mildly good effects through more products and more continuous takeoff. I agree that releasing weights probably increases hype and investment if anything. I also think that right now, democratizing safety research probably outweighs all those concerns, which is why I'm mainly worried about Meta etc. not having very clear (and reasonable) decision criteria for when they'll stop releasing weights.
5Garrett Baker
I take this argument very seriously. It in fact does seem the case that very much of the safety research I'm excited about happens on open source models. Perhaps I'm more plugged into the AI safety research landscape than the capabilities research landscape? Nonetheless, I think not even considering low-hanging-fruit effects, there's a big reason to believe open sourcing your model will have disproportionate safety gains: Capabilities research is about how to train your models to be better, but the overall sub-goal of safety research right now seems to be how to verify properties of your model. Certainly framed like this, releasing the end-states of training (or possibly even training checkpoints) seems better suited to the safety research strategy than the capabilities research strategy.
2[comment deleted]
7johnswentworth
The main model I know of under which this matters much right now is: we're pretty close to AGI already, it's mostly a matter of figuring out the right scaffolding. Open-sourcing weights makes it a lot cheaper and easier for far more people to experiment with different scaffolding, thereby bringing AGI significantly closer in expectation. (As an example of someone who IIUC sees this as the mainline, I'd point to Connor Leahy.)
2Garrett Baker
Sounds like a position someone could hold, and I guess it would make sense why those with such beliefs wouldn’t say the why too loud. But this seems unlikely. Is this really the reason so many are afraid?
2Vladimir_Nesov
I don't get the impression that very many are affraid of direct effects of open sourcing of current models. The impression that many in AI safety are afraid of specifically that is a major focus of ridicule from people who didn't bother to investigate, and a reason to not bother to investigate. Possibly this alone fuels the meme sufficiently to keep it alive.
2Garrett Baker
Sorry, I don't understand your comment. Can you rephrase?
4Vladimir_Nesov
I regularly encounter the impression that AI safety people are significantly afraid about direct consequences of open sourcing current models, from those who don't understand the actual concerns. I don't particularly see it from those who do. This (from what I can tell, false) impression seems to be one of relatively few major memes that keep people from bothering to investigate. I hypothesize that this dynamic of ridiculing of AI safety with such memes is what keeps them alive, instead of there being significant truth to them keeping them alive.
4Garrett Baker
To be clear: The mechanism you're hypothesizing is: 1. Critics say "AI alignment is dumb because you want to ban open source AI!" 2. Naive supporters read this, believe the claim that AI alignment-ers want to ban open sourcing AI and think 'AI alignment is not dumb, therefore open sourcing AI must be bad'. When the next weight release happens they say "This is bad! Open sourcing weights is bad and should be banned!" 3. Naive supporters read other naive supporters saying this, and believe it themselves. Wise supporters try to explain no, but are either labeled as a critic or weird & ignored. 4. Thus, a group think is born. Perhaps some wise critics "defer to the community" on the subject.
2Vladimir_Nesov
I don't think here is a significant confused naive supporter source of the meme that gives it teeth. It's more that reasonable people who are not any sort of supporters of AI safety propagate this idea, on the grounds that it illustrates the way AI safety is not just dumb, but also dangerous, and therefore worth warning others about. From the supporter side, "Open Model Weights are Unsafe and Nothing Can Fix This" is a shorter and more convenient way of gesturing to the concern, and convenience is the main force in the Universe that determines all that actually happens in practice. On naive reading such gesturing centrally supports the meme. This doesn't require the source of such support to have a misconception or to oppose publishing open weights of current models on the grounds of direct consequences.
4Chris_Leong
Doesn't releasing the weights inherently involve releasing the architecture (unless you're using some kind of encrypted ML)? A closed-source model could release the architecture details as well, but one step at a time. Just to be clear, I'm trying to push things towards a policy that makes sense going forward and so even if what you said about not providing any interesting architectural insight is true, I still think we need to push these groups to defining a point at which they're going to stop releasing open models.
4Matt Goldenberg
The classic effect of open sourcing is to hasten the commoditization and standardization of the component, which then allows an explosion of innovation on top of that stable base. If you look at what's happened with Stable Diffusion, this is exactly what we see.  While it's never been a cutting edge model (until soon with SD3), there's been an explosion of capabilities advances in image model generation from it.  Controlnet, best practices for LORA training, model merging, techniques for consistent characters and animation, alll coming out of the open source community. In LLM land, though not as drastic, we see similar things happening, in particular technqiues for merging models to get rapid capability advances, and rapid creation of new patterns for agent interactions and tool use. So while the models themselves might not be state of the art, open sourcing the models obviously pushes the state of the art.
2Garrett Baker
The biggest effect open sourcing LLMs seems to have is improving safety techniques. Why think this differentially accelerates capabilities over safety?
7Matt Goldenberg
it doesn't seem like that's the case to me - but even if it were the case, isn't that moving the goal posts of the original post?
2Garrett Baker
You are right, but I guess the thing I do actually care about here is the magnitude of the advancement (which is relevant for determining the sign of the action). How large an effect do you think the model merging stuff has (I'm thinking the effect where if you train a bunch of models, then average their weights, they do better). It seems very likely to me its essentially zero, but I do admit there's a small negative tail that's greater than the positive, so the average is likely negative. As for agent interactions, all the (useful) advances there seem things that definitely would have been made even if nobody released any LLMs, and everything was APIs.
2Matt Goldenberg
it's true, but I don't think there's anything fundamental preventing the same sort of proliferation and advances in open source LLMs that we've seen in stable diffusion (aside from the fact that LLMs aren't as useful for porn). that it has been relatively tame so far doesn't change the basic pattern of how open source effects the growth of technology
1Shankar Sivarajan
I'll believe it when I see it. The man who said it would be an open release has just been fired stepped down as CEO.
2Matt Goldenberg
yeah, it's much less likely now
4JBlack
I don't particularly care about any recent or very near future release of model weights in itself. I do very much care about the policy that says releasing model weights is a good idea, because doing so bypasses every plausible AI safety model (safety in the notkilleveryone sense) and future models are unlikely to be as incompetent as current ones.

Robin Hanson has been writing regularly, at about the same quality for almost 20 years. Tyler Cowen too, but personally Robin has been much more influential intellectually for me. It is actually really surprising how little his insights have degraded via return-to-the-mean effects. Anyone else like this?

5ryan_greenblatt
IMO robin is quite repetitive (even relative to other blogs like Scott Alexander's blog). So the quality is maybe the same, but the marginal value add seems to me to be substantially degrading.
2Garrett Baker
I think that his insights are very repetitive, but the application of them is very diverse, and few feel comfortable or able applying them but him. And this is what allows him to have similar quality for almost 20 years. Scott Alexander not so, his insights are diverse, but their applications not that much, but this means he’s degrading from his high. (I also think he’s just a damn good writer, which also degrades to the mean. Robin was never a good writer)
4Morpheus
Not exactly what you were looking for, but recently I noticed that there were a bunch of John Wentworth's posts that I had been missing out on that he wrote over the past 6 years. So if you get a lot out of them too, I recommend just sorting by 'old'. I really liked don't get distracted by the boilerplate (The first example made something click about math for me that hadn't clicked before, which would have helped me to engage with some “boilerplate” in a more productive way.). I also liked constraints and slackness, but I didn't go beyond the first exercise yet. There's also more technical posts that I didn't have the time to dig into yet. bhauth doesn't have as long a track record, but I got some interesting ideas from his blog which aren't on his lesswrong account. I really liked proposed future economies and the legibility bottleneck.

Last night I had a horrible dream: That I had posted to LessWrong a post filled with useless & meaningless jargon without noticing what I was doing, then I went to slee, and when I woke up I found I had karma on the post. When I read the post myself I noticed how meaningless the jargon was, and I myself couldn't resist giving it a strong-downvote.

Some have pointed out seemingly large amounts of status-anxiety EAs generally have. My hypothesis about what's going on:

A cynical interpretation: for most people, altruism is significantly motivated by status-seeking behavior. It should not be all that surprising if most effective altruists are motivated significantly by status in their altruism. So you've collected several hundred people all motivated by status into the same subculture, but status isn't a positive-sum good, so not everyone can get the amount of status they want, and we get the above dynamic: people get immense status anxiety compared to alternative cultures because in alternative situations they'd just climb to the proper status-level in their subculture, out-competing those who care less about status. But here, everyone cares about status to a large amount, so those who would have out-competed others in alternate situations are unable to and feel bad about it.

The solution?

One solution given this world is to break EA up into several different sub-cultures. On a less grand, more personal, scale, you could join a subculture outside EA and status-climb to your heart's content in there.

Preferably a subculture with very few status-seekers, but with large amounts of status to give. Ideas for such subcultures?

An interesting strategy, which seems related to FDT's prescription to ignore threats, which seems to have worked:

From the very beginning, the People’s Republic of China had to maneuver in a triangular relationship with the two nuclear powers, each of which was individually capable of posing a great threat and, together, were in a position to overwhelm China. Mao dealt with this endemic state of affairs by pretending it did not exist. He claimed to be impervious to nuclear threats; indeed, he developed a public posture of being willing to accept hundreds of millions of casualties, even welcoming it as a guarantee for the more rapid victory of Communist ideology. Whether Mao believed his own pronouncements on nuclear war it is impossible to say. But he clearly succeeded in making much of the rest of the world believe that he meant it—an ultimate test of credibility.

From Kissinger's On China, chapter 4 (loc 173.9).

8Vladimir_Nesov
FDT doesn't unconditionally prescribe ignoring threats. The idea of ignoring threats has merit, but FDT specifically only points out that ignoring a threat sometimes has the effect of the threat (or other threats) not getting made (even if only counterfactually). Which is not always the case. Consider a ThreatBot that always makes threats (and follows through on them), regardless of whether you ignore them. If you ignore ThreatBot's threats, you are worse off. On the other hand, there might be a prior ThreatBotMaker that decides whether to make a ThreatBot depending on whether you ignore ThreatBot's threats. What FDT prescribes in this case is not directly ignoring ThreatBot's threats, but rather taking notice of ThreatBotMaker's behavior, namely that it won't make a ThreatBot if you ignore ThreatBot's threats. This argument only goes through when there is/was a ThreatBotMaker, it doesn't work if there is only a ThreatBot. If a ThreatBot appears through some process that doesn't respond to your decision to respond to ThreatBot's threats, then FDT prescribes responding to ThreatBot's threats. But also if something (else) makes threats depending on your reputation for responding to threats, then responding to even an unconditionally manifesting ThreatBot's threats is not recommended by FDT. Not directly as a recommendation to ignore something, rather as a consequence of taking notice of the process that responds to your having a reputation of not responding to any threats. Similarly with stances where you merely claim that you won't respond to threats.
2Garrett Baker
China under Mao definitely seemed to do more than say they won’t respond to threats. Thus, the Korean war, and notably no nuclear threats were made, proving conventional war was still possible in a post-nuclear world. For practical decisions, I don’t think threatbots actually exist if you’re a state by form other than natural disasters. Mao’s china was not good at natural disasters, but probably because Mao was a marxist and legalist, not because he conspicuously ignored them. When his subordinates made mistakes which let him know something was going wrong in their province, I think he would punish the subordinate and try to fix it.
5JesseClifton
I don't think FDT has anything to do with purely causal interactions. Insofar as threats were actually deterred here this can be understood in standard causal game theory terms.  (I.e., you claim in a convincing manner that you won't give in -> People assign high probability to you being serious -> Standard EV calculation says not to commit to threat against you.) Also see this post.
2Garrett Baker
Thus why I said related. Nobody was doing any mind-reading of course, but the principles still apply, since people are often actually quite good at reading each other.
2JesseClifton
What principles? It doesn’t seem like there’s anything more at work here than “Humans sometimes become more confident that other humans will follow through on their commitments if they, e.g., repeatedly say they’ll follow through”. I don’t see what that has to do with FDT, more than any other decision theory.  If the idea is that Mao’s forming the intention is supposed to have logically-caused his adversaries to update on his intention, that just seems wrong (see this section of the mentioned post). (Separately I’m not sure what this has to do with not giving into threats in particular, as opposed to preemptive commitment in general. Why were Mao’s adversaries not able to coerce him by committing to nuclear threats, using the same principles? See this section of the mentioned post.)     
3Garrett Baker
Far more interesting, and probably effective, than the boring classical game theory doctrine of MAD, and even Schelling's doctrine of strategic irrationality!
2Garrett Baker
The book says this strategy worked for similar reasons as the strategy in the story The Romance of the Three Kingdoms: But Mao obviously wasn't fooling anyone about China's military might!

My latest & greatest project proposal, in case people want to know what I'm doing, or give me money. There will likely be a LessWrong post up soon where I explain in a more friendly way my thoughts.

Over the next year I propose to study the development and determination of values in RL & supervised learning agents, and to expand the experimental methods & theory of singular learning theory (a theory of supervised learning) to the reinforcement learning case.

All arguments for why we should expect AI to result in an existential risk rely on AIs ha

... (read more)
2Garrett Baker
And here is that post

Since it seems to be all the rage nowadays, due to Aschenbrenner's Situational Awareness, here's a Manifold market I created on when the first (or whether any) AGI company will be "nationalized".

I would be in the never camp, unless the AI safety policy people get their way. But I don't like betting in my own markets (it makes them more difficult to judge in the case of an edge-case).

4Alexander Gietelink Oldenziel
Never ? That's quite a bold prediction. Seems more likely than not that AI companies will be effectively nationalized. I'm curious why you think it will never happen.

In particular, 25% chance of nationalization by EOY 2040.

I think in fast-takeoff worlds, the USG won't be fast enough to nationalize the industry, and in slow-takeoff worlds, the USG will pursue regulation on the level of military contractors of such companies, but won't nationalize them. I mainly think this because this is the way the USG usually treats military contractors (including strict & mandatory security requirements, and gatekeeping the industry), and really its my understanding of how it treats most projects it wants to get done which it doesn't already have infrastructure in place to complete. 

Nationalization, in the US, is just very rare. 

Even during world war 2, my understanding is very few industries---even those vital to the war effort---were nationalized. People love talking about the Manhattan Project, but that was not an industry that was nationalized, that was a research project started by & for the government. AI is a billion-dollar industry. The AGI labs (their people, leaders, and stock-holders [or in OAI's case, their profit participation unit holders]) are not just going to sit idly by as they're taken over. 

And neither may the nation... (read more)

3Andrew Burns
This. Very much. Truman tried to nationalize steel companies on the basis of national security to get around a strike. Was badly benchslapped.

If Adam is right, and the only way to get great at research is long periods of time with lots of mentor feedback, then MATS should probably pivot away from the 2-6 month time-scales they've been operating at, and toward 2-6 year timescales for training up their mentees.

[-]habryka1116

Seems like the thing to do is to have a program that happens after MATS, not to extend MATS. I think in-general you want sequential filters for talent, and ideally the early stages are as short as possible (my guess is indeed MATS should be a bit shorter).

2Garrett Baker
Seems dependent on how much economies of scale matter here. Given the main cost (other than paying people) is ops, and relationships (between MATS and the community, mentors, funders, and mentees), I think its pretty possible the efficient move is to have MATS get into this niche.
2Thomas Kwa
Who is Adam? Is this FAR AI CEO Adam Gleave?
2Garrett Baker
Yes
1Joseph Miller
Yes, Garrett is referring to this post: https://www.lesswrong.com/posts/yi7shfo6YfhDEYizA/more-people-getting-into-ai-safety-should-do-a-phd
2Garrett Baker
Of course, it would then be more difficult for them to find mentors, mentees, and money. But if all of those scale down similarly, then there should be no problem.

A Theory of Usable Information Under Computational Constraints

We propose a new framework for reasoning about information in complex systems. Our foundation is based on a variational extension of Shannon's information theory that takes into account the modeling power and computational constraints of the observer. The resulting \emph{predictive V-information} encompasses mutual information and other notions of informativeness such as the coefficient of determination. Unlike Shannon's mutual information and in violation of the data processing inequality, V-

... (read more)
4Alexander Gietelink Oldenziel
Can somebody explain to me what's happening in this paper ?

My reading is their definition of conditional predictive entropy is the naive generalization of Shannon's conditional entropy given that the way that you condition on some data is restricted to only being able to implement functions of a particular class. And the corresponding generalization of mutual information becomes a measure of how much more predictable does some piece of information become (Y) given evidence (X) compared to no evidence.

For example, the goal of public key cryptography cannot be to make the mutual information between a plaintext, and public key & encrypted text zero, while maintaining maximal mutual information between the encrypted text and plaintext given the private key, since this is impossible.

Cryptography instead assumes everyone involved can only condition their probability distributions using polynomial time algorithms of the data they have, and in that circumstance you can minimize the predictability of your plain text after getting the public key & encrypted text, while maximizing the predictability of the plain text after getting the private key & encrypted text.

More mathematically, they assume you can only implement functions from your ... (read more)

1[comment deleted]

From The Guns of August

Old Field Marshal Moltke in 1890 foretold that the next war might last seven years—or thirty—because the resources of a modern state were so great it would not know itself to be beaten after a single military defeat and would not give up [...] It went against human nature, however—and the nature of General Staffs—to follow through the logic of his own prophecy. Amorphous and without limits, the concept of a long war could not be scientifically planned for as could the orthodox, predictable, and simple solution of decisive battle an

... (read more)

Yesterday I had a conversation with a person very much into cyborgism, and they told me about a particular path to impact floating around the cyborgism social network: Evals.

I really like this idea, and I have no clue how I didn't think of it myself! Its the obvious thing to do when you have a bunch of insane people (used as a term of affection & praise by me for such people) obsessed with language models, who are also incredibly good & experienced at getting the models to do whatever they want. I would trust these people red-teaming a model and te... (read more)

3NicholasKees
@janus wrote a little bit about this in the final section here, particularly referencing the detection of situational awareness as a thing cyborgs might contribute to. It seems like a fairly straightforward thing to say that you would want the people overseeing AI systems to also be the ones who have the most direct experience interacting with them, especially for noticing anomalous behavior.
2Garrett Baker
I just reread that section, and I think I didn’t recognized it the first time because I wasn’t thinking “what concrete actions is Janus implicitly advocating for here”. Though maybe I just have worse than average reading comprehension.
3mesaoptimizer
I have no idea if this is intended to be read as irony or not, and the ambiguity is delicious.
2Garrett Baker
There now exist two worlds I must glomarize between. In the first, the irony is intentional, and I say “wouldn’t you like to know”. In the second, its not, “Irony? What irony!? I have no clue what you’re talking about”.
2jacquesthibs
I think many people focus on doing research that focuses on full automation, but I think it's worth trying to think in the semi-automated frame as well when trying to come up with a path to impact. Obviously, it isn't scalable, but it may be more sufficient than we'd think by default for a while. In other words, cyborgism-enjoyers might be especially interested in those kinds of evals, capability measurements that are harder to pull out of the model through traditional evals, but easier to measure through some semi-automated setup.

More evidence for in-context RL, in case you were holding out for mechanistic evidence LLMs do in-context internal-search & optimization.

In-context learning, the ability to adapt based on a few examples in the input prompt, is a ubiquitous feature of large language models (LLMs). However, as LLMs' in-context learning abilities continue to improve, understanding this phenomenon mechanistically becomes increasingly important. In particular, it is not well-understood how LLMs learn to solve specific classes of problems, such as reinforcement learning (R

... (read more)

Progress in neuromorphic value theory

Animals perform flexible goal-directed behaviours to satisfy their basic physiological needs1,2,3,4,5,6,7,8,9,10,11,12. However, little is known about how unitary behaviours are chosen under conflicting needs. Here we reveal principles by which the brain resolves such conflicts between needs across time. We developed an experimental paradigm in which a hungry and thirsty mouse is given free choices between equidistant food and water. We found that mice collect need-appropriate rewards by structuring their choices into p

... (read more)
2Garrett Baker
Seems also of use to @Quintin Pope