Wiki Contributions

Comments

mishka21d51

Thanks! I think your discussion of the new Meaning Alignment Institute publication (the substack post and the paper) in the Aligning a Smarter Than Human Intelligence is Difficult section is very useful.

I wonder if it makes sense to republish it as a separate post, so that more people see it...

mishka1mo72

Emmett Shear continues his argument that trying to control AI is doomed

I think that a recent tweet thread by Michael Nielsen and the quoted one by Emmett Shear represent genuine progress towards making AI existential safety more tractable.

Michael Nielsen observes, in particular:

As far as I can see, alignment isn't a property of an AI system. It's a property of the entire world, and if you are trying to discuss it as a system property you will inevitably end up making bad mistakes

Since AI existential safety is a property of the whole ecosystem (and is, really, not too drastically different from World existential safety), this should be the starting point, rather than stand-alone properties of any particular AI system.

Emmett Shear writes:

Hopefully you’ve validated whatever your approach is, but only one of these is stable long term: care. Because care can be made stable under reflection, people are careful (not a coincidence, haha) when it comes to decisions that might impact those they care about.

And Zvi responds

Technically I would say: Powerful entities generally caring about X tends not to be a stable equilibrium, even if it is stable ‘on reflection’ within a given entity. It will only hold if caring more about X provides a competitive advantage against other similarly powerful entities, or if there can never be a variation in X-caring levels between such entities that arises other than through reflection, and also reflection never causes reductions in X-caring despite this being competitively advantageous. Also note that variation in what else you care about to what extent is effectively variation in X-caring.

Or more bluntly: The ones that don’t care, or care less, outcompete the ones that care.

Even the best case scenarios here, when they play out the ways we would hope, do not seem all that hopeful.

That all, of course, sets aside the question of whether we could get this ‘caring’ thing to operationally work in the first place. That seems very hard.


Let's now consider this in light of what Michael Nielsen is saying.

I am going to only consider the case where we have plenty of powerful entities with long-term goals and long-term existence which care about their long-term goals and long-term existence. This seems to be the case which Zvi is considering here, and it is the case we understand the best, because we also live in the reality with plenty of powerful entities (ourselves, some organizations, etc) with long-term goals and long-term existence. So this is an incomplete consideration: it only includes the scenarios where powerful entities with long-term goals and long-terms existence retain a good fraction of overall available power.

So what do we really need? What are the properties we want the World to have? We need a good deal of conservation and non-destruction, and we need the interests of weaker, not the currently most smart or most powerful members of the overall ecosystem to be adequately taken into account.

Here is how we might be able to have a trajectory where these properties are stable, despite all drastic changes of the self-modifying and self-improving ecosystem.

An arbitrary ASI entity (just like an unaugmented human) cannot fully predict the future. In particular, it does not know where it might eventually end up in terms of relative smartness or relative power (relative to the most powerful ASI entities or to the ASI ecosystem as a whole). So if any given entity wants to be long-term safe, it is strongly interested in the ASI society having general principles and practices of protecting its members on various levels of smartness and power. If only the smartest and most powerful are protected, then no entity is long-term safe on the individual level.

This might be enough to produce effective counter-weight to unrestricted competition (just like human societies have mechanisms against unrestricted competition). Basically, smarter-than-human entities on all levels of power are likely to be interested in the overall society having general principles and practices of protecting its members on various levels of smartness and power, and that's why they'll care enough for the overall society to continue to self-regulate and to enforce these principles.

This is not yet the solution, but I think this is pointing in the right direction...

mishka1mo30

Thanks, this is very interesting.

I wonder if this approach is extendable to learning to predict the next word from a corpus of texts...

The first layer might perhaps still be embedding from words to vectors, but what should one do then? What would be a possible minimum viable dataset?

Perhaps, in the spirit of PoC of the paper, one might consider binary sequences of 0s and 1s, and have only two words, 0 and 1, and ask what would it take to have a good predictor of the next 0 or 1 given a long sequence of those as a context. This might be a good starting point, and then one might consider different examples of that problem (different examples of (sets of) sequences of 0 and 1 to learn from).

mishka1mo63

This looks interesting, thanks!

This post could benefit from an extended summary.

In lieu of such a summary, in addition to the abstract

This paper introduces semantic features as a candidate conceptual framework for building inherently interpretable neural networks. A proof of concept model for informative subproblem of MNIST consists of 4 such layers with the total of 5K learnable parameters. The model is well-motivated, inherently interpretable, requires little hyperparameter tuning and achieves human-level adversarial test accuracy - with no form of adversarial training! These results and the general nature of the approach warrant further research on semantic features. The code is available at this https URL

I'll quote a paragraph from Section 1.2, "The core idea"

This paper introduces semantic features as a general idea for sharing weights inside a neural network layer. [...] The concept is similar to that of "inverse rendering" in Capsule Networks where features have many possible states and the best-matching state has to be found. Identifying different states by the subsequent layers gives rise to controlled dimensionality reduction. Thus semantic features aim to capture the core characteristic of any semantic entity - having many possible states but being at exactly one state at a time. This is in fact a pretty strong regularization. As shown in this paper, choosing appropriate layers of semantic features for the [Minimum Viable Dataset] results in what can be considered as a white box neural network.

mishka1mo10

Is there a simple way to run against a given Kaggle competition after that particular competition is over?

These are reasonable benchmarks, but is there a way to make them less of a moving target, so that the ability to run against a given competition extends into the future?

mishka1mo30

When I ponder all this I usually try to focus on the key difficulty of AI existential safety. ASIs and ecosystems of ASIs normally tend to self-modify and self-improve rapidly. Arbitrary values and goals are unlikely to be preserved through radical self-modifications.

So one question is: what are values and goals which might be natural, what are values and goals many members of the ASI ecosystem including entities which are much smarter and much more powerful than unaugmented humans might be naturally inclined to preserve through drastic self-modifications, and which might also have corollaries which are good from viewpoints of various human values and interests?

Basically, values and goals which are likely to be preserved through drastic self-modifications are probably non-anthropocentric, but they might have sufficiently anthropocentric corollaries, so that even the interests of unaugmented humans are protected (and also so that humans who would like to self-augment or to merge with AI systems can safely do so).

So what do we really need? We need a good deal of conservation and non-destruction, and we need the interests of weaker, not the currently most smart or most powerful members of the overall ecosystem to be adequately taken into account.

What might the road towards adoption of those values (of preservation and protection of weaker entities and their interests) and incorporation of those kinds of goals be, so that these values and goals are preserved as the ASI ecosystem and many of its members self-modify drastically? One really needs the situation where many entities on varying levels of smartness and power care a lot about those values and goals.

An arbitrary ASI entity (just like an unaugmented human) cannot fully predict the future. In particular, it does not know where it might eventually end up in terms of relative smartness or relative power (relative to the most powerful ASI entities or to the ASI ecosystem as a whole). So if any given entity wants to be long-term safe, it is strongly interested in the ASI society having general principles and practices of protecting its members on various levels of smartness and power. If only the smartest and most powerful are protected, then no entity is long-term safe on the individual level.

This might be a reasonable starting point towards having values and goals which are likely to be preserved through drastic self-modifications and self-improvements and which are likely to imply good future for unaugmented humans and for the augmented humans as well.

Perhaps, we can nudge things towards making this kind of future more likely. (The alternatives are bleak, unrestricted drastic competition and, perhaps, even warfare between supersmart entities which would probably end badly not just for us, but for the ASI ecosystem as well)...

mishka1mo30

The worlds were artificial superintelligence (ASI) is coming very soon with only roughly current levels of compute, and where ASI by default goes catastrophically badly, are not worlds I believe we can afford to save.

For those kinds of worlds, we probably can't afford to save them by heavy-handed measures (meaning that not only the price is too much to pay, given that it would have to be paid in all other worlds too, but also that it would not be possible to avoid evasion of those measures, if it is already relatively easy to create ASI).

But this does not mean that more light-weight measures are hopeless. For example, when we look at Ilya's thought process he has shared with us last year, what he has been trying to ponder have been various potentially feasible relatively light-weight measures to change this default of ASI going catastrophically badly to more acceptable outcomes: Ilya Sutskever's thoughts on AI safety (July 2023): a transcript with my comments.

More people should do more thinking of this kind. AI existential safety is a pre-paradigmatic field of study. A lot of potentially good approaches and possible non-standard angles have not been thought of at all. A lot of potentially good approaches and possible non-standard angles have been touched upon very lightly and need to be explored further. The broad consensus in the "AI existential safety" community seems to be that we currently don't have good approaches we can comfortably rely upon for future ASI systems. This should encourage more people to look for completely novel approaches (and in order to cover the worlds where timelines are short, some of those approaches have to be simple rather than ultra-complicated, otherwise they would not be ready by the time we need them).

In particular, it is certainly true that

A smarter thing that is more capable and competitive and efficient and persuasive being kept under indefinite control by a dumber thing that is less capable and competitive and efficient and persuasive is a rather unnatural and unlikely outcome. It should be treated as such until proven otherwise.

So it makes sense spending more time brainstorming the approaches which would not require "a smarter thing being kept under indefinite control by a dumber thing". What might be a relatively light-weight intervention which would avoid ASI going catastrophically badly without trying to rely on this unrealistic "control route"? I think this is underexplored. We tend to impose too many assumptions, e.g. the assumption that ASI has to figure out a good approximation to our Coherent Extrapolated Volition and to care about that in order for things to go well. People might want to start from scratch instead. What if we just have a goal of having good chances for a good trajectory conditional on ASI, and we don't assume that any particular feature of earlier approaches to AI existential safety is a must-have, but only this goal is a must-have. Then this gives one a wider space of possibilities to brainstorm: what other properties besides caring about a good approximation to our Coherent Extrapolated Volition might be sufficient for the trajectory of the world with ASI to likely be good from our point of view?

mishka1mo100

I don't see a non-paywalled version of this paper, but here are two closely related preprints by the same authors:

"Mechanism of feature learning in deep fully connected networks and kernel machines that recursively learn features", https://arxiv.org/abs/2212.13881

"Mechanism of feature learning in convolutional neural networks", https://arxiv.org/abs/2309.00570

mishka1mo10

Paul Christiano: "Catastrophic Misalignment of Large Language Models"

Talk recording: https://www.youtube.com/watch?v=FyTb81SS_cs (no transcript or chat replay yet)

Talk page: https://pli.princeton.edu/events/2024/catastrophic-misalignment-large-language-models

Load More