IKumar — LessWrong

Value systematization: how values become coherent (and misaligned)

Or a model could directly reason about which new values would best systematize its current values, with the intention of having its conclusions distilled into its weights; this would be an example of gradient hacking.

Quick clarifying question - the ability to figure out which direction in weight space an update should be applied in order to modify a neural net's values seems like it would require a super strong understanding of mechanistic interpretability - something far past current human levels. Is this an underlying assumption for a model that is able to direct how its values will be systematised?

Some of my predictable updates on AI

IKumar2y10

I like the style of this post, thanks for writing it! Some thoughts:

model scaling stops working

Roughly what probability would you put on this? I see this as really unlikely (perhaps <5%) such that ‘scaling stops working’ isn’t part of my model over the next 1-2yrs.

I will be slightly surprised if by end of 2024 there are AI agents running around the internet that are meaningfully in control of their own existence, e.g., are renting their own cloud compute without a human being involved.

Only slightly surprised? IMO being able to autonomously rent cloud compute seems quite significant (technically and legally), and I’d be very surprised if something like this happened on a 1yr horizon. I’d be negatively surprised if the US government didn’t institute regulation on the operation of autonomous agents of this type by the end of 2024, basically due to their potential for misuse and their economic value. It may help to know how you're operationalizing AIs that are ‘meaningfully aware of their own existence’.

Muddling Along Is More Likely Than Dystopia

IKumar2y110

Policy makers do not know this. They know that someone is telling them this. They definitely do not know that they will get the economic promises of AGI on the timescales they care about, if they support this particular project.

I feel differently here. It seems that a lot of governments have woken up to AI in the past few years, and are putting it at the forefront of national strategies, e.g. see the headline here. In the past year there has been a lot of movement in the regulatory space, but I’m still getting undertones of ‘we realise that AI is going to be huge, and we want to establish global leadership in this technology’.

So going back to your nuclear example, I think the relevant question is: ‘What allowed policymakers to gain the necessary support to push stringent nuclear regulation through, even though it offered huge economic benefits?’. I think there are two things:

It takes a significant amount of time, ~6-8 years, for a nuclear powerplant to be built and begin operating (and even longer for it to breakeven). So whilst they are economically practical in the long-term, it can be hard to garner the support for the huge initial investment. To make this clearer, imagine if it took ~1 year to build a nuclear power plant, and 2 years for it to breakeven. If that were the case, I think it would have been harder to push stringent regulation through.
There was a lot of irrational public fear about anything nuclear, due to powerplant accidents, the Cold War and memories of nuclear weapon use during WWII.

With respect to AI, I don’t think (1) holds. That is, the economic benefits of AI will be far easier to realise than that for nuclear (you can train and deploy an AI system within a year, and likely breakeven a few years after that), meaning that policymaker support for regulation will be harder.

(2) might hold, this really depends on the nature of AI accidents over the next few years, and their impacts on public perception. I’d be interested in your thoughts here.

Labs should be explicit about why they are building AGI

IKumar2y40

I'm not sure about how costly these sorts of proposals are (e.g. because it makes customers think you're crazy). Possibly, labs could coordinate to release things like this simultaneously to avoid tragedy of the commons (there might be anti-trust issues with this).

Yep, buy-in from the majority of frontier labs seems pretty important here. If OpenAI went out and said ‘We think that there’s a 10% chance that AGI we develop kills over 1 billion people’, but Meta kept their current stance (along the lines of ‘we think that the AI x-risk discussion is fearmongering and the systems we’re building will be broadly beneficial for humanity’) then I’d guess that OpenAI would lose a ton of business. From the point of view of an enterprise using OpenAI’s products, it can’t help your public image to be using the products of a lab that thinks it has a 10% chance of ending the world - especially if there are other labs offering similar products that don’t carry this burden. In a worst-case scenario, I can imagine that this puts OpenAI directly in the firing line of regulators, whilst Meta gets off far more lightly.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments