aogara

DPhil Student in AI at Oxford, and grantmaking on AI safety at Longview Philanthropy. 

Sequences

Center for AI Safety Blog Posts
AI Safety Newsletter

Wiki Contributions

Comments

Sorted by
aogara2-1

DeepSeek-R1 naturally learns to switch into other languages during CoT reasoning. When developers penalized this behavior, performance dropped. I think this suggests that the CoT contained hidden information that cannot be easily verbalized in another language, and provides evidence against the hope that reasoning CoT will be highly faithful by default.  

aogara20

Wouldn't that conflict with the quote? (Though maybe they're not doing what they've implied in the quote)

aogara82

Process supervision seems like a plausible o1 training approach but I think it would conflict with this:

We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to "read the mind" of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought.

I think it might just be outcome-based RL, training the CoT to maximize the probability of correct answers or maximize human preference reward model scores or minimize next-token entropy. 

aogara30

This is my impression too. See e.g. this recent paper from Google, where LLMs critique and revise their own outputs to improve performance in math and coding. 

aogara159

Agreed, sloppy phrasing on my part. The letter clearly states some of Anthropic's key views, but doesn't discuss other important parts of their worldview. Overall this is much better than some of their previous communications and the OpenAI letter, so I think it deserves some praise, but your caveat is also important. 

aogara155

Really happy to see the Anthropic letter. It clearly states their key views on AI risk and the potential benefits of SB 1047. Their concerns seem fair to me: overeager enforcement of the law could be counterproductive. While I endorse the bill on the whole and wish they would too (and I think their lack of support for the bill is likely partially influenced by their conflicts of interest), this seems like a thoughtful and helpful contribution to the discussion. 

aogara1310

I think there's a decent case that SB 1047 would improve Anthropic's business prospects, so I'm not sure this narrative makes sense. On one hand, SB 1047 might make it less profitable to run an AGI company, which is bad for Anthropic's business plan. But Anthropic is perhaps the best positioned of all AGI companies to comply with the requirements of SB 1047, and might benefit significantly from their competitors being hampered by the law. 

The good faith interpretation of Anthropic's argument would be that the new agency created by the bill might be very bad at issuing guidance that actually reduces x-risk, and you might prefer the decision-making of AI labs with a financial incentive to avoid catastrophes without additional pressure to follow the exact recommendations of the new agency. 

Answer by aogara62

My understanding is that LLCs can be legally owned and operated without any individual human being involved: https://journals.library.wustl.edu/lawreview/article/3143/galley/19976/view/

So I’m guessing an autonomous AI agent could own and operate an LLC, and use that company to purchase cloud compute and run itself, without breaking any laws.

Maybe if the model escaped from the possession of a lab, there would be other legal remedies available.

Of course, cloud providers could choose not to rent to an LLC run by an AI. This seems particularly likely if the government is investigating the issue as a natsec threat.

Over longer time horizons, it seems highly likely that people will deliberately create autonomous AI agents and deliberately release them into the wild with the goal of surviving and spreading, unless there are specific efforts to prevent this.

aogara20

Has MIRI considered supporting work on human cognitive enhancement? e.g. Foresight's work on WBE

aogara20

Very cool, thanks! This paper focuses on building a DS Agent, but I’d be interested to see a version of this paper that focuses on building a benchmark. It could evaluate several existing agent architectures, benchmark them against human performance, and leave significant room for improvement by future models.

Load More