Eli Tyre - LessWrong

Towards more cooperative AI safety strategies

That gives us a bit more time to figure out how to work with that societal attention as it continues to grow.

Unfortunately, as societal attention to AI ramps up, less and less of that attention will go to "us".

Towards more cooperative AI safety strategies

I'm much more doubtful than most people around here about whether CEV coheres: I guess that the CEV of some humans wireheads themselves and the CEV of other humans doesn't, for instance.

But I'm bracketing that concern, for this discussion. Assuming CEV coheres, then yes I predict that it will have radical (in the sense of a political radical who's beliefs are extremely outside of the overton window, such that they are disturbing to the median voter) views about all of those things.

But more confidently, I predict that it will have radical views about a very long list of things that are commonplace in 2024, even if it turns out that I'm wrong about this specific set.

CEV asks what would we want if we knew everything the AI knows. There are dozens of things that I think that I know, that if the average person knew to be true, would invalidate a lot of their ideology. If the average person knew everything that an AGI knows (which includes potentially millions of subjective years of human science, whole new fields, as foundational to one's worldview as economics and probability theory is to my current worldview), and they had hundreds of subjective years to internalize those facts and domains, in a social context that was conducive to that, with (potentially) large increases in their intelligence, I expect their views are basically unrecognizable after a process like that.

As a case in point, most people consider it catastrophically bad to have their body destroyed (duh). And if you asked them if they would prefer, given their body being destroyed, to have their brain-state recorded, uploaded, and run on a computer, many would say "no", because it seems horrifying to them.

Most LessWrongers embrace computationalism: they think that living as an upload is about as good as living as a squishy biological robot (and indeed, better in many respects). They would of course choose to be uploaded if their body was being destroyed. Many would elect to have their body destroyed specifically because they would prefer to be uploaded!

That is most LessWrongers think they know something which most people don't know, but which, if they did know it, would radically alter their preferences and behavior.

I think a mature AGI knows at least thousands of things like that.

So among the things about CEV that I'm most confident about (again, granting that it coheres at all), is that CEV has extremely radical views, conclusions which are horrifying to most people, including probably myself.

Towards more cooperative AI safety strategies

Eli Tyre4d20

I do.

I mean, it depends on the point of the exact CEV procedure. But yes.

Q&A on Proposed SB 1047

Eli Tyre5d20

That the people advocating for this and similar laws are statists that love regulation.
Seriously. no. It is remarkable the extent to which the opposite is true.

I'm keenly aware that many of the main advocates of this and similar regulations are basically old-school free-market, (more or less) minimal-government, libertarians, who discovered the unfortunate fact that the world seems likely to be destroyed by AI development.

(I'm one of them.)

But I would guess that, in addition to those people, this bill and others like it do have supporters who are in favor of increasing the power of the state, or hampering big tech, basically for the sake of it? There are a fair number of those people around.

Q&A on Proposed SB 1047

Eli Tyre5d20

Right now it matters at most to the very biggest handful of labs.

That sounds right, but it's unclear to me how many companies would want to train 10^26 FLOP models in 2030.

I think still not very many, because training a model is a big industrial process, with major economies of scale and winner-take-most effects. It's a place where specialization really makes sense. I guess that there will be less than 20 companies in the US that are training models of that size, and everyone else is licensing them / using them through the API.

But the Bill apparently makes a provision for that, in that the standards for what counts as a covered model change after 2027.

On the Proposed California SB 1047

Eli Tyre5d42

This could then be built upon.

I would like to know how that process works. How does passing one law impact laws that might or might not be passed in the future.

If you pass a law like this, do the loopholes often get patched by other laws later?
Does passing a law like this one get enshrined as "the California law about AI", and so take up the slot that might have been spent on a better law in 2025? (At which point we might have a better understanding of the shape of some AI risks?)
If this passes, I presume it will never ever be repealed. Does that mean that errors made here are basically permanent?

It seems like those kinds of dynamics mostly dominate what I think of this particular bill, since (as noted) if it helps, it only helps a little and it seems to have some more or less important loopholes.

Towards more cooperative AI safety strategies

Eli Tyre5d30

Whether most existing humans would be opposed is not a criterion of Friendliness.

I think if you described what was going to happen many and maybe humans would say they prefer the status quo to a positive CEV-directed singularity. Perhaps it depends on which parts of "what's going to happen" you focus on, some are more obviously good or exciting than others. Curing cancer is socially regarded as 👍 while curing death and dismantling governments are typically (though not universally) regarded as 👎.

I don't think they will actually provide much opposition, because a superhuman persuader will be steering the trajectory of events. (Ostensively, by using only truth tracking arguments and inputs that allow us to converge on the states of belief that we would reflectively prefer, but we mere humans won't be able to distinguish that from malicious superhuman manipulation.)

But again, how humans would react is neither here nor there for what a Friendly AI does. The AI does what the CEV of humans would want, not what the humans want.

Eli's shortform feed

Eli Tyre5d24

Yes.

Eli's shortform feed

Eli Tyre6d9-3

In this interview, Eliezer says the following:

I think if you push anything [referring to AI systems] far enough, especially on anything remotely like the current paradigms, like if you make it capable enough, the way it gets that capable is by starting to be general.

And at the same sort of point where it starts to be general, it will start to have it's own internal preferences, because that is how you get to be general. You don't become creative and able to solve lots and lots of problems without something inside you that organizes your problem solving and that thing is like a preference and a goal. It's not built in explicitly, it's just something that's sought out by the process that we use to grow these things to be more and more capable.

It caught my attention, because it's a concise encapsulation of something that I already knew Eliezer thought, and which seems to me to be a crux between "man, we're probably all going to die" and "we're really really fucked", but which I don't myself understand.

So I'm taking a few minutes to think through it afresh now.

I agree that systems get to be very powerful by dint of their generality.

(There are some nuances around that: part of what makes GPT-4 and Claude so useful is just that they've memorized so much of the internet. That massive knowledge base helps make up for their relatively shallow levels of intelligence, compared to smart humans. But the dangerous/scary thing is definitely AI systems that are general enough to do full science and engineering processes.)

I don't (yet?) see why generality implies having a stable motivating preference.

If an AI system is doing problem solving, that does definitely entail that it has a goal, at least in some local sense: It has the goal of solving the problem in question. But that level of goal is more analogous to the prompt given to an LLM than it is to a robust utility function.

I do have the intuition that creating an SEAI by training an RL agent on millions of simulated engineering problems is scary, because of reward specification problems of your simulated engineering problems. It will learn to hack your metrics.

But an LLM trained on next-token prediction doesn't have that problem?

Could you use next token prediction to build a detailed world model, that contains deep abstractions that describe reality (beyond the current human abstractions), and then prompt it, to elicit those models?

Something like, you have the AI do next token prediction on all the physics papers, and all the physics time-series, and all the text on the internet, and then you prompt it to write the groundbreaking new physics result that unifies QM and GR, citing previously overlooked evidence.

I think Eliezer says "no, you can't, because to discover deep theories like that requires thinking and not just "passive" learning in the ML sense of updating gradients until you learn abstractions that predict the data well. You need to generate hypotheses and test them."

In my state of knowledge, I don't know if that's true.

Is that a crux for him? How much easier is the alignment problem, if it's possible to learn superhuman abstractions "passively" like that?

I mean there's still a problem that someone will build a more dangerous agent from components like that. And there's still a problem that you can get world-altering technologies / world-destroying technologies from that kind of oracle.

We're not out of the woods. But it would mean that building a superhuman SEAI isn't an immediate death sentence for humanity.

I think I still don't get it.

Would a scope-insensitive AGI be less likely to incapacitate humanity?

Eli Tyre6d50

I think the key word you want to search for is "Myopia". This is plausibly a beneficial alignment property, but like every plausibly beneficial alignment property, we don't yet know how to instill them in a system via ML training.

LESSWRONG
LW

Posts

Wiki Contributions

Comments