Wiki Contributions

Comments

mishka10

Nobody currently knows how to align strongly superhumanly smart AIs to human interests, and we need way more time to solve this problem. Making incremental progress on AI capabilities is shortening the timeline we have left to figure out how to align AI and is thus making human extinction more likely. Thus by far the best action is to stop advancing AI capabilities.

It seems that not much research is done into studying invariant properties of rapidly self-modifying ecosystems. At least, when I did some search and also asked here a few months ago, not much came up: https://www.lesswrong.com/posts/sDapsTwvcDvoHe7ga/what-is-known-about-invariants-in-self-modifying-systems.

It's not possible to have a handle on the dynamics of rapidly self-modifying ecosystems without better understanding how to think about properties conserved during self-modification. And ecosystems with rapidly increasing capabilities will be strongly self-modifying.

However, any progress in this direction is likely to be dual-use. Knowing how to think about self-modification invariants is very important for AI existential safety and is also likely to be a strong capability booster.

This is a very typical conundrum for AI existential safety. We can try to push harder to make sure that the research into invariant properties of self-modifying (eco)systems is an active research area again, but the likely side-effect of better understanding properties of potentially fooming systems is making it easier to bring these systems into existence. And we don't have good understanding of proper ways to handle this kind of situations (although the topic of dual-use is discussed here from time to time).

mishka54

No, OpenAI (assuming that it is a well-defined entity) also uses a probability distribution over timelines.

(In reality, every member of its leadership has their own probability distribution, and this translates to OpenAI having a policy and behavior formulated approximately as if there is some resulting single probability distribution).

The important thing is, they are uncertain about timelines themselves (in part, because no one knows how perplexity translates to capabilities, in part, because there might be difference with respect to capabilities even with the same perplexity, if the underlying architectures are different (e.g. in-context learning might depend on architecture even with fixed perplexity, and we do see a stream of potentially very interesting architectural innovations recently), in part, because it's not clear how big is the potential of "harness"/"scaffolding", and so on).

This does not mean there is no political infighting. But it's on the background of them being correctly uncertain about true timelines...


Compute-wise, inference demands are huge and growing with popularity of the models (look how much Facebook did to make LLama 3 more inference-efficient).

So if they expect models to become useful enough for almost everyone to want to use them, they should worry about compute, assuming they do want to serve people like they say they do (I am not sure how this looks for very strong AI systems; they will probably be gradually expanding access, and the speed of expansion might depend).

mishka54

I think having a probability distribution over timelines is the correct approach. Like, in the comment above:

I think I'm more likely to be better calibrated than any of these opinions, because most of them don't seem to focus very much on "hedging" or "thoughtful doubting", whereas my event space assigns non-zero probability to ensembles that contain such features of possible futures (including these specific scenarios).

mishka32

However, none of them talk about each other, and presumably at most one of them can be meaningfully right?

Why at most one of them can be meaningfully right?

Would not a simulation typically be "a multi-player game"?

(But yes, if they assume that their "original self" was the sole creator (?), then they would all be some kind of "clones" of that particular "original self". Which would surely increase the overall weirdness.)

mishka20

The "AI #61: Meta Trouble" has not been cross-posted to LessWrong, but here is the link to the original post: https://thezvi.wordpress.com/2024/04/25/ai-61-meta-trouble/

mishka51

Thanks! I think your discussion of the new Meaning Alignment Institute publication (the substack post and the paper) in the Aligning a Smarter Than Human Intelligence is Difficult section is very useful.

I wonder if it makes sense to republish it as a separate post, so that more people see it...

mishka72

Emmett Shear continues his argument that trying to control AI is doomed

I think that a recent tweet thread by Michael Nielsen and the quoted one by Emmett Shear represent genuine progress towards making AI existential safety more tractable.

Michael Nielsen observes, in particular:

As far as I can see, alignment isn't a property of an AI system. It's a property of the entire world, and if you are trying to discuss it as a system property you will inevitably end up making bad mistakes

Since AI existential safety is a property of the whole ecosystem (and is, really, not too drastically different from World existential safety), this should be the starting point, rather than stand-alone properties of any particular AI system.

Emmett Shear writes:

Hopefully you’ve validated whatever your approach is, but only one of these is stable long term: care. Because care can be made stable under reflection, people are careful (not a coincidence, haha) when it comes to decisions that might impact those they care about.

And Zvi responds

Technically I would say: Powerful entities generally caring about X tends not to be a stable equilibrium, even if it is stable ‘on reflection’ within a given entity. It will only hold if caring more about X provides a competitive advantage against other similarly powerful entities, or if there can never be a variation in X-caring levels between such entities that arises other than through reflection, and also reflection never causes reductions in X-caring despite this being competitively advantageous. Also note that variation in what else you care about to what extent is effectively variation in X-caring.

Or more bluntly: The ones that don’t care, or care less, outcompete the ones that care.

Even the best case scenarios here, when they play out the ways we would hope, do not seem all that hopeful.

That all, of course, sets aside the question of whether we could get this ‘caring’ thing to operationally work in the first place. That seems very hard.


Let's now consider this in light of what Michael Nielsen is saying.

I am going to only consider the case where we have plenty of powerful entities with long-term goals and long-term existence which care about their long-term goals and long-term existence. This seems to be the case which Zvi is considering here, and it is the case we understand the best, because we also live in the reality with plenty of powerful entities (ourselves, some organizations, etc) with long-term goals and long-term existence. So this is an incomplete consideration: it only includes the scenarios where powerful entities with long-term goals and long-terms existence retain a good fraction of overall available power.

So what do we really need? What are the properties we want the World to have? We need a good deal of conservation and non-destruction, and we need the interests of weaker, not the currently most smart or most powerful members of the overall ecosystem to be adequately taken into account.

Here is how we might be able to have a trajectory where these properties are stable, despite all drastic changes of the self-modifying and self-improving ecosystem.

An arbitrary ASI entity (just like an unaugmented human) cannot fully predict the future. In particular, it does not know where it might eventually end up in terms of relative smartness or relative power (relative to the most powerful ASI entities or to the ASI ecosystem as a whole). So if any given entity wants to be long-term safe, it is strongly interested in the ASI society having general principles and practices of protecting its members on various levels of smartness and power. If only the smartest and most powerful are protected, then no entity is long-term safe on the individual level.

This might be enough to produce effective counter-weight to unrestricted competition (just like human societies have mechanisms against unrestricted competition). Basically, smarter-than-human entities on all levels of power are likely to be interested in the overall society having general principles and practices of protecting its members on various levels of smartness and power, and that's why they'll care enough for the overall society to continue to self-regulate and to enforce these principles.

This is not yet the solution, but I think this is pointing in the right direction...

mishka30

Thanks, this is very interesting.

I wonder if this approach is extendable to learning to predict the next word from a corpus of texts...

The first layer might perhaps still be embedding from words to vectors, but what should one do then? What would be a possible minimum viable dataset?

Perhaps, in the spirit of PoC of the paper, one might consider binary sequences of 0s and 1s, and have only two words, 0 and 1, and ask what would it take to have a good predictor of the next 0 or 1 given a long sequence of those as a context. This might be a good starting point, and then one might consider different examples of that problem (different examples of (sets of) sequences of 0 and 1 to learn from).

Load More