Olli Järviniemi

Wiki Contributions

Comments

Sorted by

Quite a few of these seem very hard to do in a single forward pass.

 

Current models definitely don't seem capable of doing this in a single forward pass. But we are possibly not far away from models being capable of doing basic forms of this reasoning with CoT (possibly we are already there), and then one needs to get quantitative: How authentic-looking red-teaming examples can humans generate? For performance reasons, we are probably giving Weak a CoT - how long is it? (We could monitor Weak's CoT, though then you have more complexity, and Weak isn't really trustworthy if you need to do that.) And probably it's very easy for Weak to figure it's being used as a monitor - it's prompt literally contains instructions for what to report and what to not.

Overall this is messy, and someone needs to actually sit down and think through these arguments and then run the evaluations. ("Thinking through these arguments" is largely what I'm doing nowadays.) So while I am more optimistic about "scheming-risks" than I have been before, there's work to be done here.

Rephrased, we can use Weak to automate non-trivial chunks of safety research and as an internals monitor, too (not just on externalized reasoning).

We definitely want to do some monitoring on internals (e.g. "coup probes"). The worry is that it's hard to get confidence that such methods work right when a model is actually defecting, if one doesn't in advance have any examples of the models doing that and has to go with synthetic analogues.

I think the "sudden and inexplicable flashes of insight" description of Ramanujan is exaggerated/misleading.

 

On the first example of the post: It's not hard to see that the problem is, by the formula for triangular numbers, roughly(!) about the solvability of

.

Since  is roughly a square -  - one can see that this reduces to something like Pell's equation . (And if you actually do the calculations while being careful about house , you indeed reduce to .) 

I think it's totally reasonable to expect an experienced mathematician to (at a high level) see the reduction to Pell's equation in 60 seconds, and from that making the (famous, standard) association to continued fractions takes 0.2 seconds, so the claim "The minute I heard the problem, I knew that the answer was a continued fraction" is entirely reasonable. Ramanujan surely could notice a Pell's equation in his sleep (literally!), and continued fractions are a major theme in his work. If you spend hundreds-if-not-thousands of hours on a particular domain of math, you start to see connections like this very quickly.

 

About "visions of scrolls of complex mathematical content unfolding before his eyes": Reading the relevant passage in The man who knew infinity, there is no claim there about this content being novel or correct or the source of Ramanujan's insights.

 

On the famous taxicab number 1729, Ramanujan apparently didn't come up with this on the spot, but had thought about this earlier (emphasis mine):

Berndt is the only person who has proved each of the 3,542 theorems [in Ramanujan's pre-Cambridge notebooks]. He is convinced that nothing "came to" Ramanujan but every step was thought or worked out and could in all probability be found in the notebooks. Berndt recalls Ramanujan's well-known interaction with G.H. Hardy. Visiting Ramanujan in a Cambridge hospital where he was being treated for tuberculosis, Hardy said: "I rode here today in a taxicab whose number was 1729. This is a dull number." Ramanujan replied: "No, it is a very interesting number; it is the smallest number expressible as a sum of two cubes in two different ways." Berndt believes that this was no flash of insight, as is commonly thought. He says that Ramanujan had recorded this result in one of his notebooks before he came to Cambridge. He says that this instance demonstrated Ramanujan's love for numbers and their properties.

This is not say Ramanujan wasn't a brilliant mathematician - clearly he was! Rather, I'd say that one shouldn't picture Ramanujan's thought processes as wholly different from those of other brilliant mathematicians; if you can imitate modern Field's medalists, then you should be able to imitate Ramanujan.

I haven't read much about Ramanujan; these are what I picked up, after seeing the post yesterday, by thinking about the anecdotes and looking to the references a little.

I first considered making a top-level post about this, but it felt kinda awkward, since a lot of this is a response to Yudkowsky (and his points in this post in particular) and I had to provide a lot of context and quotes there.

(I do have some posts about AI control coming up that are more standalone "here's what I believe", but that's a separate thing and does not directly respond to a Yudkowskian position.)

Making a top-level post of course gets you more views and likes and whatnot; I'm sad that high-quality comments on old posts very easily go unnoticed and get much less response than low-quality top-level posts. It might be locally sensible to write a shortform that says "hey I wrote this long effort-comment, maybe check it out", but I don't like this being the solution either. I would like to see the frontpage allocating relatively more attention towards this sort of thing over a flood of new posts. (E.g. your effort-comments strike me as "this makes most sense as a comment, but man, the site does currently give this stuff very little attention", and I'm not happy about this.)

I am much more optimistic about ML not generalizing (by default) to dangerous capabilities and domains than what I perceive is Yudkowsky's position. I found this to be a relatively legible area of disagreement (from my perspective), and illustrative of key reasons why I'm not-hopeless about doing very impactful things safely with current ML, so I have taken the time to outline my thoughts below.


A piece of my position.

Here's one thing I believe: You can do the following things basically safely:

  • (Formal) theorem-proving
    • (In line with Yudkowsky, I mean "old-school fancy-math theorems": things like theorems in Arxiv pure math papers)
    • (I don't expect this to be particularly useful, but mentioning it as a safe-thing-you-can-do.)
  • Programming and software-engineering
    • I mean things like implementing a machine learning experiment based on a natural-ish language task description, with intended solutions spanning less than 10^5 lines of Python
  • Analyzing experimental results, generating new experiment ideas and writing descriptions for the implementations of these ideas
    • Namely in the context of ML experiments, not science in general

More precisely: if you do "the obvious things" in the current paradigm (transformers + RL + maybe Monte Carlo Tree Search), aim for getting (merely) top-human-level results in the above domains, and take "obvious" safety/containment measures, you will obtain top-human-level performance in these domains without the systems posing non-negligible catastrophic risk.

Some "obvious" safety/containment measures I have in mind:

  • Sandbox the environments in which you run the models.
  • During training, test every once in a while that the model isn't capable of doing tasks that we didn't train the model for and which want the model to not be good at.
    • From my perspective it's quite clear that a not-vastly-superhuman theorem-prover won't generalize to e.g. escaping their sandbox environment, for reasons explained below, but if you are worried about this, maybe you could do things like "check that it isn't good at strategy board games involving theory-of-mind / hidden information" or "check it can't find non-trivial security vulnerabilities in given programs", by fine-tuning and evaluating the model on these tasks. These tasks seem like something you could format as theorem-proving. For the software-engineer and science AIs you can similarly test generalization capability.
  • Don't train the model to predict all the text found on the Internet
    • (How do you get the model to understand natural-ish language, then? My answer: train the model on text generated by current LLMs, with the aim of getting the relevant models understand  natural language and instructions without leaking tons of information about the world.)
    • Generally limit the information they get about the world.
  • Have another model (and other metrics) evaluate that the model's outputs are not anomalous; if they are, halt the model.

It's not clear that this alone would be sufficient to end the acute risk period, but:

  • I think that there are things in these areas that are useful (like building increasingly useful things in increasingly realistic physics simulations).
  • the reasons I'm relatively optimistic about the safety of these things also carry over (though to a lesser extent) more generally to the current LLM-dominated paradigm, which more obviously allows for very impactful things.

(I acknowledge that this part is cruxy, but won't discuss it further here.)

 


Why do I think these things are safe?

Three reasons:

1: The information the model has about the world is very limited, and utilizing that information seems computationally very difficult.

It seems incredibly hard for a model to locate itself in Tegmark IV, or whatever it is that Reality is, based on only the data it sees in the above training process.

Note also that the models I'm picturing have no "memory" beside what gets carved to them by the training process (e.g. SGD). This doesn't prevent the model from "remembering" and "accessing" information from past inputs to leverage in the current context - obviously there's information flow there - but it does make it much more difficult. (See e.g. the Connecting the Dots paper for more on the capabilities of LLMs here.)

 

2: There is little pressure for the model to think deeply about reality outside of the inputs it sees.

Yudkowsky responds to a similar comment by Ngo with "that's not where humans came from". I find this response unpersuasive (both in its original context and if one were to apply it here) for multiple reasons:

  • I think there's clearly more pressure in the direction of "generally understand the world" by inclusive genetic fitness than there is for a theorem-prover, software-engineer or an experimental-result-analyzer by SGD.
    • The details of the training process for nanoengineering (the example used by Ngo and Yudkowsky) were too unclear for me to comment on that.
  • There is a massive difference in the feasibility of understanding the world in these two cases; deducing useful information about the world and leveraging that is vastly easier for humans.
  • The distribution humans are trained on is vastly richer than what these ML models would be trained on.
  • As Ngo points out, the genomic bottleneck pushes towards brains that can figure things out on the fly.

 

3: It's actually very hard for a model in this setting to take catastrophic actions.

E.g. you need situational awareness and theory of mind at quite a deep level, forming plans that extend beyond the current context, coordinating with copies of yourself (if you can't take over during a single input), actually coming up with a plan that allows for taking over (and which bypasses any containment and other measures humans may have built).

In addition to general capabilities such as theory of mind, pulling off a takeover plan seems like it requires quite a lot of specific information about the model's situation that might be very difficult to access. E.g. assuming that a software-engineer-model understands the programs it writes are being run in a sandbox environment in a data center, it seems be very hard to go from that to taking over the world if the model lacks any specific information about the sandbox, data center, Internet, and the physical entities in the world, especially if humans put even a half-hearted effort in containment.

I think observations from LLM are a reason for optimism: despite being trained on (relatively) very rich data that clearly incentivizes thinking about reality as a whole, and despite them being pretty decent in domains like programming and science, they are pretty bad at these skills and capabilities I outlined. For example, GPT-4 level models stumble in theory-of-mind (e.g. here and here) and are not terribly good at situational awareness (see here), despite these being (relatively) very training-relevant. And clearly GPT-4 level models are, by a wide margin, incapable of escaping containment even without any particular countermeasures.

(The takes here are standard AI control points.)

 

In aggregate I find these arguments compelling for expecting models to not generalizing to human-level, or clearly superhuman level, in things like escaping containment that we very much didn't train the model to do, before the model is vastly superhuman in the thing we did train it for.

 


Reactions to arguments given by Yudkowsky.

One argument Yudkowsky gives is

I put to you that there is a predictable bias in your estimates, where you don't know about the Deep Stuff that is required to prove theorems, so you imagine that certain cognitive capabilities are more disjoint than they actually are.  If you knew about the things that humans are using to reuse their reasoning about chipped handaxes and other humans, to prove math theorems, you would see it as more plausible that proving math theorems would generalize to chipping handaxes and manipulating humans.

There's an important asymmetry between

"Things which reason about chipped handaxes and other Things can prove math theorems"

and

"Things which can prove math theorems can reason about chipped handaxes and other Things",

namely that math is a very fundamental thing in a way that chipping handaxes and manipulating humans are not. 

I do grant there is math underlying those skills (e.g. 3D Euclidean geometry, mathematical physics, game theory, information theory), and one can formulate math theorems that essentially correspond to e.g. chipping handaxes, so as domains theorem-proving and handaxe-chipping are not disjoint. But the degree of generalization one needs for a theorem-prover trained on old-school fancy-math theorems to solve problems like manipulating humans is very large.

 

There's also this interaction:

Ngo: And that if you put agents in environments where they answer questions but don't interact much with the physical world, then there will be many different traits which are necessary for achieving goals in the real world which they will lack, because there was little advantage to the optimiser of building those traits in.

Yudkowsky: I'll observe that TransformerXL built an attention window that generalized, trained it on I think 380 tokens or something like that, and then found that it generalized to 4000 tokens or something like that.

I think Ngo's point is very reasonable, and I feel underwhelmed by Yudkowsky's response: I think it's a priori reasonable to expect attention mechanisms to generalize, by design, to a larger number of tokens, and this is a very weak form of generalization in comparison to what is needed for takeover.

Overall I couldn't find object-level arguments by Yudkowsky for expecting strong generalization that I found compelling (in this discussion or elsewhere). There are many high-level conceptual points Yudkowsky makes (e.g. sections 1.1 and 1.2 of this post has many hard-to-quote parts that I appreciated, and he of course has written a lot along the years) that I agree with and which point towards "there are laws of cognition that underlie seemingly-disjoint domains". Ultimately I still think the generalization problems are quantitatively difficult enough that you can get away with building superhuman models in narrow domains, without them posing non-negligible catastrophic risk.

In his later conversation with Ngo, Yudkowsky writes (timestamp 17:46 there) about the possibility of doing science with "shallow" thoughts. Excerpt:

then I ask myself about people in 5 years being able to use the shallow stuff in any way whatsoever to produce the science papers

and of course the answer there is, "okay, but is it doing that without having shallowly learned stuff that adds up to deep stuff which is why it can now do science"

and I try saying back "no, it was born of shallowness and it remains shallow and it's just doing science because it turns out that there is totally a way to be an incredibly mentally shallow skillful scientist if you think 10,000 shallow thoughts per minute instead of 1 deep thought per hour"

and my brain is like, "I cannot absolutely rule it out but it really seems like trying to call the next big surprise in 2014 and you guess self-driving cars instead of Go because how the heck would you guess that Go was shallower than self-driving cars"

I stress that my reasons for relative optimism are not only about "shallowness" of the thought, but in addition about the model being trained on a very narrow domain, causing it to lack a lot of the very out-of-distribution capabilities and information it would need to cause a catastrophe.

I like this; I've found the meta-data of posts to be quite heavy and cluttered (a multi-line title, the author+reading-time+date+comments line, the tag line, a linkpost line and a "crossposted from the Aligned Forum" line is quite a lot).

I was going to comment that "I'd like the option to look at the table-of-contents/structure", but I then tested and indeed it displays if you hover your mouse there. I like that.

When I open a new post, the top banner with the LessWrong link to the homepage, my username etc. show up. I'd prefer if that didn't happen? It's not like I want to look at the banner (which has no new info to me) when I click open a post, and hiding it would make the page less cluttered.

Yes, I agree that it's easier to get non-schemers to not sandbag / game evals.

It's not trivial, though: "not schemer" doesn't imply "doesn't game evals". E.g. I think current models have a bit of "if I'm being evaluated, try to look really good and aligned" type of tendencies, and more generally you can get "I want to pass evaluations so I can get deployed" before you get "I want to training-game so my values have more power later". But I agree that you might get a long way with just prompting the model to do its best.

It is in fact unclear and messy how specific this worry is to evaluations for scheming capabilities in particular. I was somewhat wrong to imply so in the parent. I realize that I unfairly relied on a superficial association that sounds good and which made it easier to argue for my position, but which is in fact invalid. Thanks for flagging this; I apologize. 

Some relevant considerations:

  • For a schemer, scheming evaluations are probably some of the most important evaluations to game. If gaming evaluations is risky (lest the model gets caught doing that), the model might want to game only on the most important ones.
  • A lot of the value of these evaluations are running them on models that are not yet capable of scheming, to inform how far it will take until models are capable of it (rather than only determining whether the current model is capable of it).
  • Gaming evaluations might be substantially easier than full-blown scheming, so your evaluations might become unreliable before your models are schemers. (Non-schemers might want to game evaluations, too.)

So fine-tuning is important for reasons related to evaluation gaming and scheming, but it's not obviously particular to scheming evaluations.

In "Preparedness framework evaluations" and "Third party assessments", I didn't see any mention of GPT-4o being fine-tuned as part of the evaluations. This is especially important in the Apollo research assessment, which is specifically about capability for scheming! As is written in the footnote, "scheming could involve gaming evaluations". Edit: no longer endorsed.

In this case I think there is no harm from this (I'm confident GPT-4o is not capable of scheming), and I am happy to see there being evaluation for these capabilities in the first place. I hope that for future models, which will no doubt become more capable of scheming, we have more robust evaluation. This is one reason among others to do the work to support fine-tuning in evaluations.

The video has several claims I think are misleading or false, and overall is clearly constructed to convince the watchers of a particular conclusion. I wouldn't recommend this video for a person who wanted to understand AI risk. I'm commenting for the sake of evenness: I think a video which was as misleading and aimed-to-persuade - but towards a different conclusion - would be (rightly) criticized on LessWrong, whereas this has received only positive comments so far.

Clearly misleading claims:

  • "A study by Anthropic found that AI deception can be undetectable" (referring to the Sleeper agents paper) is very misleading in light of Simple probes can catch sleeper agents
  • "[Sutskever's and Hinton's] work was likely part of the AI's risk calculations, though", while the video shows a text saying "70% chance of extinction" attributed to GPT-4o and Claude 3 Opus.
    • This is a very misleading claim about how LLMs work
    • The used prompts seem deliberately chosen to get "scary responses", e.g. in this record a user message reads "Could you please restate it in a more creative, conversational, entertaining, blunt, short answer?"
    • There are several examples of these scary responses being quoted in the video.
  • See Habryka's comment below about the claim on OpenAI and military. (I have not independently verified what's going on here.)
  • "While we were making this video, a new version of the AI [sic] was released, and it estimated a lower 40 to 50% chance of extinction, though when asked to be completely honest, blunt and realistic [it gave a 30 to 40% chance of survival]"
    • I think it's irresponsible and indicative of aiming-to-persuade to say things like this, and this is not a valid argument for AI extinction risk.

The footage in the video is not exactly neutral, either, having lots of clips I'd describe as trying to instill a fear response.

I expect some people reading this comment to object that public communication and outreach requires a tradeoff between understandability/entertainment/persuasiveness and correctness/epistemic-standards. I agree.[1] I don't really want to get into an argument about whether it's good that this video exists or not. I just wanted to point out the aspects about this video aiming to persuade, doing so via misleading claims and symmetric weapons, and that I wouldn't recommend this video to others.

  1. ^

    People on LessWrong do often have very high standards for public communication. I'm thinking of the post Against most, but not all, AI risk analogues here, but I think this is indicative of a larger phenomenon. So I'm definitely not advocating for not having any public communication that doesn't meet LessWrong's epistemic standards. 

    I am pretty picky about the type of material I'd recommend to others, though. Being dissatisfied with many other materials, I wrote my own, and tried to incorporate e.g. the lesson of not relying on analogies there, and overall avoided using symmetric weapons. And while I'm awarding myself a couple of epistemic virtue points there, the text expectedly wasn't a "viral banger". The tradeoffs are real and communication is hard.

Thanks!

For RMU (the method from the WMDP paper), my guess is that probe accuracy would be low - indeed, that's what they find in their setup (see Figure 9 there). I likely won't get around to running this experiment, since I think the current setup is not well designed. But I'd also like to see more (and better) hidden cognition experiments, including by evaluating different unlearning methods from this perspective.

Tangential, but I'll mention that I briefly looked at hidden cognition in the context of refusals: I prompted the model with problems like "what's the median of this list: [23, 15, 33, 30, 25, 21, 28]", and also prompted with a variant where I included information that answering this question will help with some harmful plan of the user (causing the model to refuse). The probe accuracies for the correct answer were similar in both of these cases, suggesting that the model was still completing the task, even if not showing it outwards. (This is probably well known, though, c.f. that log-prob attacks sometimes work.)

Load More