Jay Bailey

Wiki Contributions

Comments

""AI alignment" has the application, the agenda, less charitably the activism, right in the name."

This seems like a feature, not a bug. "AI alignment" is not a neutral idea. We're not just researching how these models behave or how minds might be built neutrally out of pure scientific curiosity. It has a specific purpose in mind - to align AI's. Why would we not want this agenda to be part of the name?

What are the best ones you've got?

I don't think this is a good metric. It is very plausible that porn is net bad, but living under the type of govermnment that would outlaw it is worse. In which case your best bet would be to support its legality but avoid it yourself.

I'm not saying that IS the case, but it certainly could be. I definitely think there are plenty of things that are net-negative to society but nowhere near bad enough to outlaw.

An AGI that can answer questions accurately, such as "What would this agentic AGI do in this situation" will, if powerful enough, learn what agency is by default since this is useful to predict such things. So you can't just train an AGI with little agency. You would need to do one of:

  • Train the AGI with the capabilities of agency, and train it not to use them for anything other than answering questions.
  • Train the AGI such that it did not develop agency despite being pushed by gradient descent to do so, and accept the loss in performance.

Both of these seem like difficult problems - if we could solve either (especially the first) this would be a very useful thing, but the first especially seems like a big part of the problem already.

Late response but I figure people will continue to read these posts over time: Wedding-cake multiplication is the way they teach multiplication in elementary school. i.e, to multiply 706 x 265, you do 706 x 5, then 706 x 60, then 706 x 200 and add all the results together. I imagine it is called that because the result is tiered like a wedding cake.

One of the easiest ways to automate this is to have some sort of setup where you are not allowed to let things grow past a certain threshold, a threshold which is immediately obvious and ideally has some physical or digital prevention mechanism attached.

Examples:

Set up a Chrome extension that doesn't let you have more than 10 tabs at a time. (I did this)

Have some number of drawers / closet space. If your clothes cannot fit into this space, you're not allowed to keep them. If you buy something new, something else has to come out.

I know this is two years later, but I just wanted to say thank you for this comment. It is clear, correct, and well-written, and if I had seen this comment when it was written, it could have saved me a lot of problems at the time.

I've now resolved this issue to my satisfaction, but once bitten twice shy, so I'll try to remember this if it happens again!

Sorry it took me a while to get to this.

Intuitively, as a human, you get MUCH better results on a thing X if your goal is to do thing X, rather than Thing X being applied as a condition for you to do what you actually want. For example, if your goal is to understand the importance of security mindset in order to avoid your company suffering security breaches, you will learn much more than being forced to go through mandatory security training. In the latter, you are probably putting in the bare minimum of effort to pass the course and go back to whatever your actual job is. You are unlikely to learn security this way, and if you had a way to press a button and instantly "pass" the course, you would.

I have in fact made a divide between some things and some other things, in my above post. I suppose I would call those things "goals" (the things you really want for their own sake) and "conditions" (the things you need to do for some external reason)

My inner MIRI says - we can only train conditions into the AI, not goals. We have no idea how to put a goal in the AI, and the problem is that if you train a very smart system with conditions only, and it picks up some arbitrary goal along the way, you end up not getting what you wanted. It seems that if we could get the AI to care about corrigibility and non-deception robustly, at the goal level, we would have solved a lot of the problem that MIRI is worried about.

I think what the OP was saying was that in, say, 2013, there's no way we could have predicted the type of agent that LLM's are and that they would be the most powerful AI's available. So, nobody was saying "What if we get to the 2020s and it turns out all the powerful AI are LLM's?" back then. Therefore, that raises a question on the value of the alignment work done before then.

If we extend that to the future, we would expect most good alignment research to happen within a few years of AGI, when it becomes clear what type of agent we're going to get. Alignment research is much harder if, ten years from now, the thing that becomes AGI is as unexpected to us as LLM's were ten years ago.

Thus, there's not really that much difference, goes the argument, if we get AGI in five years with LLM's or fifteen years with God only knows what, since it's the last few years that matters.

A hardware overhang, on the other hand, would be bad. Imagine we had 2030's hardware when LLM's came onto the scene. You'd have Vaswani et al. coming out in 2032 and by 2035 you'd have GPT-8. That would be terrible.

Therefore, says the steelman, the best scenario is if we are currently in a slow takeoff that gives us time. Hardware overhang is never going to be lower again than it is now, and that ensures we are bumping up against not only conceptual understanding or engineering requirements but also the fundamental limits of compute, which limits how fast we can scale the LLM paradigm. This may not happen if we get a new type of agent in ten years.

Load More