Knight Lee

Wikitag Contributions

Comments

Sorted by
Knight Lee*Ω450

This is a very beautiful idea! It feels like the kind of clever discoveries that we need.

I think one possible generalization of MONA, is that a relatively trusted but weaker model makes the decisions, but a more stronger but untrusted model gets trained to give ideas/advice to the weaker model. Its RL goal is not how well the weaker model performs, just whether the weaker model likes its ideas/advice.

This generalization preserves MONA's advantage over scalable oversight: if the stronger model's reasons are hidden or incomprehensible to the weaker model, the stronger model can't get away with it. It won't be rewarded for learning such reasons in the first place.

Just like scalable oversight, the weaker model might have an architecture which improves alignment at a capability cost.

It's more general than MONA in the sense the approval feedback can be swapped for any trusted but weaker model, which doesn't just judge ideas but uses ideas. It is allowed to learn over time which ideas work better, but its learning process is relatively safer (due to its architecture or whatever reason we trust it more).

Do you think this is a next step worth exploring?

EDIT: I'm not sure about doing foresight-via-optimization RL on the weaker model anymore. Maybe the weaker model uses HCH or something safer than foresight-via-optimization RL.

Maybe one concrete implementation would be, when doing RL[1] on an AI like o3, they don't give it a single math question to solve. Instead, they give it like 5 quite different tasks, and the AI has to allocate its time to work on the 5 tasks.

I know this sounds like a small boring idea, but it might actually help if you really think about it! It might cause the resulting agent's default behaviour pattern to be "optimize multiple tasks at once" rather than "optimize a single task ignoring everything else." It might be the key piece of RL behind the behaviour of "whoa I already optimized this goal very thoroughly, it's time I start caring about something else," and this might actually be the behaviour that saves humanity.

  1. ^

    RL = reinforcement learning

Question: what fraction of work should prioritize the gradual disempowerment risk, and what fraction of work should prioritize the treacherous turn risk? (Guesstimate)

Question 2: what is your response to this argument?

  • The main driving force of gradual disempowerment seems to be "societal inevitability," and "millions of people seeing the problem in front of their eyes but unable to convince society to take action."

    If that is the main problem, shouldn't you assume this problem to be even worse right now? Right now the AI safety community is extremely tiny ($0.1 billion/year, or 000.1% of the world), and the problem appears even harder to seriously believe than it will be. It is also harder to find solutions before seeing the problem.

    One analogy is a tiny group of people in 1990, who could foresee democratic backsliding and fake news happening around 2020. Assuming they have $0.1 billion/year and a mediocre reputation, what is the probability they can use their earliness to fix these problems (given that people in 2020 were unable to fix them)?

Although I thought of this argument, I don't think it's necessarily correct and my intuition about it is very fuzzy and uncertain. I just want to hear your response.

That's a good idea! Even today it may be useful for export controls (depending on how reliable it can be made).

The most powerful chips might be banned from export, and have "verified boot" technology inside in case they are smuggled out.

The second most powerful chips might be only exported to trusted countries, and also have this verified boot technology in case these trusted countries end up selling them to less trusted countries who sell them yet again.

I do think that convincing the government to pause AI in a way which sacrifices $3000 billion economic value, is relatively easier than directly spending $3000 billion on AI safety.

Maybe spending $1 is similarly hard to sacrificing $10-$100 of future economic value via preemptive regulation.[1]

But $0.1 billion AI safety spending is so ridiculously little (1000 times less than capabilities spending), increasing it may still be the "easiest" thing to do. Of course we should still push for regulation at the same time (it doesn't hurt).

PS: what do you think of my open letter idea for convincing the government to increase funding?

  1. ^

    Maybe "future economic value" is too complicated. A simpler guesstimate would be "spending $1 is similarly hard to sacrificing $10 of company valuations via regulation."

I think both duration and funding are important.

I agree that increasing duration has a greater impact than increasing funding. But increasing duration is harder than increasing funding.

AI safety spending is only $0.1 billion while AI capabilities spending is $200 billion. Increasing funding by 10x is relatively more attainable, while increasing duration by 10x would require more of a miracle.

Even if you believe that funding today isn't very useful and funding in the future is more useful, increasing funding now moves the Overton window a lot. It's hard for any government which has traditionally spent only $0.01 billion to suddenly spend $100 billion. They'll use the previous budget as an anchor point to decide the new budget.

My guess is that 4x funding  2x duration.[1]

  1. ^

    For inventive steps, having twice as many "inventors" reduces the time to invention by half, while for engineering steps, having twice as many "engineers" doesn't help very much.

    (Assuming the time it takes each inventor to think of an invention is an independent exponential distribution)

I would go one step further and argue you don't need to take over territory to shut down the semiconductor supply chain, if enough large countries believed AI risk was a desperate problem they could convince and negotiate the shutdown of the supply chain.

Shutting down the supply chain (and thus all leading-edge semiconductor fabs) could slow the AI project by a long time, but probably not "150 years" since the uncooperative countries will eventually build their own supply chain and fabs.

Even if building intelligence requires solving many many problems, preventing that intelligence from killing you may just require solving a single very hard problem. We may go from having no idea to having a very good idea.

I don't know. My view is that we can't be sure of these things.

Thank you, I've always been curious about this point of view because a lot of people have a similar view to yours.

I do think that alignment success is the most likely avenue, but my argument doesn't require this assumption.

Your view isn't just that "alternative paths are more likely to succeed than alignment," but that "alternative paths are so much more likely to succeed than alignment, that the marginal capabilities increase caused by alignment research (or at least Anthropic), makes them unworthwhile."

To believe that alignment is that hopeless, there should be stronger proof than "we tried it for 22 years, and the prior probability of the threshold being between 22 years and 23 years is low." That argument can easily be turned around to argue why more alignment research is equally unlikely to cause harm (and why Anthropic is unlikely to cause harm). I also think multiplying funding can multiply progress (e.g. 4x funding  2x duration).

If you really want a singleton controlling the whole world (which I don't agree with), your most plausible path would be for most people to see AI risk as a "desperate" problem, and for governments under desperation to agree on a worldwide military which swears to preserve civilian power structures within each country.[1]

Otherwise, the fact that no country took over the world during the last centuries strongly suggests that no country will in the next few years, and this feels more solid than your argument that "no one figured out alignment in the last 22 years, so no one will in the next few years."

  1. ^

    Out of curiosity, would you agree with this being the most plausible path, even if you disagree with the rest of my argument?

Hi,

I've just read this post, and it is disturbing what arguments Anthropic made about how the US needs to be ahead of China.

I didn't really catch up to this news, and I think I know where the anti-Anthropic sentiment is coming from now.

I do think that Anthropic only made those arguments in the context of GPU export controls, and trying to convince the Trump administration to do export controls if nothing else. It's still very concerning, and could undermine their ability to argue for strong regulation in the future.

That said, I don't agree with the nuclear weapon explanation.

Suppose Alice and Bob were each building a bomb. Alice's bomb has a 10% chance of exploding and killing everyone, and a 90% chance of exploding into rainbows and lollipops and curing cancer. Bob's bomb has a 10% chance of exploding and killing everyone, and a 90% chance of "never being used" and having a bunch of good effects via "game theory."

I think people with ordinary moral views will not be very angry at Alice, but forgive Bob because "Bob's bomb was built to not be used."

Load More