Wiki Contributions

Comments

mishka4h1-1

I think having a probability distribution over timelines is the correct approach. Like, in the comment above:

I think I'm more likely to be better calibrated than any of these opinions, because most of them don't seem to focus very much on "hedging" or "thoughtful doubting", whereas my event space assigns non-zero probability to ensembles that contain such features of possible futures (including these specific scenarios).

mishka6h10

However, none of them talk about each other, and presumably at most one of them can be meaningfully right?

Why at most one of them can be meaningfully right?

Would not a simulation typically be "a multi-player game"?

(But yes, if they assume that their "original self" was the sole creator (?), then they would all be some kind of "clones" of that particular "original self". Which would surely increase the overall weirdness.)

mishka2d20

The "AI #61: Meta Trouble" has not been cross-posted to LessWrong, but here is the link to the original post: https://thezvi.wordpress.com/2024/04/25/ai-61-meta-trouble/

mishka1mo51

Thanks! I think your discussion of the new Meaning Alignment Institute publication (the substack post and the paper) in the Aligning a Smarter Than Human Intelligence is Difficult section is very useful.

I wonder if it makes sense to republish it as a separate post, so that more people see it...

mishka1mo72

Emmett Shear continues his argument that trying to control AI is doomed

I think that a recent tweet thread by Michael Nielsen and the quoted one by Emmett Shear represent genuine progress towards making AI existential safety more tractable.

Michael Nielsen observes, in particular:

As far as I can see, alignment isn't a property of an AI system. It's a property of the entire world, and if you are trying to discuss it as a system property you will inevitably end up making bad mistakes

Since AI existential safety is a property of the whole ecosystem (and is, really, not too drastically different from World existential safety), this should be the starting point, rather than stand-alone properties of any particular AI system.

Emmett Shear writes:

Hopefully you’ve validated whatever your approach is, but only one of these is stable long term: care. Because care can be made stable under reflection, people are careful (not a coincidence, haha) when it comes to decisions that might impact those they care about.

And Zvi responds

Technically I would say: Powerful entities generally caring about X tends not to be a stable equilibrium, even if it is stable ‘on reflection’ within a given entity. It will only hold if caring more about X provides a competitive advantage against other similarly powerful entities, or if there can never be a variation in X-caring levels between such entities that arises other than through reflection, and also reflection never causes reductions in X-caring despite this being competitively advantageous. Also note that variation in what else you care about to what extent is effectively variation in X-caring.

Or more bluntly: The ones that don’t care, or care less, outcompete the ones that care.

Even the best case scenarios here, when they play out the ways we would hope, do not seem all that hopeful.

That all, of course, sets aside the question of whether we could get this ‘caring’ thing to operationally work in the first place. That seems very hard.


Let's now consider this in light of what Michael Nielsen is saying.

I am going to only consider the case where we have plenty of powerful entities with long-term goals and long-term existence which care about their long-term goals and long-term existence. This seems to be the case which Zvi is considering here, and it is the case we understand the best, because we also live in the reality with plenty of powerful entities (ourselves, some organizations, etc) with long-term goals and long-term existence. So this is an incomplete consideration: it only includes the scenarios where powerful entities with long-term goals and long-terms existence retain a good fraction of overall available power.

So what do we really need? What are the properties we want the World to have? We need a good deal of conservation and non-destruction, and we need the interests of weaker, not the currently most smart or most powerful members of the overall ecosystem to be adequately taken into account.

Here is how we might be able to have a trajectory where these properties are stable, despite all drastic changes of the self-modifying and self-improving ecosystem.

An arbitrary ASI entity (just like an unaugmented human) cannot fully predict the future. In particular, it does not know where it might eventually end up in terms of relative smartness or relative power (relative to the most powerful ASI entities or to the ASI ecosystem as a whole). So if any given entity wants to be long-term safe, it is strongly interested in the ASI society having general principles and practices of protecting its members on various levels of smartness and power. If only the smartest and most powerful are protected, then no entity is long-term safe on the individual level.

This might be enough to produce effective counter-weight to unrestricted competition (just like human societies have mechanisms against unrestricted competition). Basically, smarter-than-human entities on all levels of power are likely to be interested in the overall society having general principles and practices of protecting its members on various levels of smartness and power, and that's why they'll care enough for the overall society to continue to self-regulate and to enforce these principles.

This is not yet the solution, but I think this is pointing in the right direction...

mishka1mo30

Thanks, this is very interesting.

I wonder if this approach is extendable to learning to predict the next word from a corpus of texts...

The first layer might perhaps still be embedding from words to vectors, but what should one do then? What would be a possible minimum viable dataset?

Perhaps, in the spirit of PoC of the paper, one might consider binary sequences of 0s and 1s, and have only two words, 0 and 1, and ask what would it take to have a good predictor of the next 0 or 1 given a long sequence of those as a context. This might be a good starting point, and then one might consider different examples of that problem (different examples of (sets of) sequences of 0 and 1 to learn from).

mishka1mo63

This looks interesting, thanks!

This post could benefit from an extended summary.

In lieu of such a summary, in addition to the abstract

This paper introduces semantic features as a candidate conceptual framework for building inherently interpretable neural networks. A proof of concept model for informative subproblem of MNIST consists of 4 such layers with the total of 5K learnable parameters. The model is well-motivated, inherently interpretable, requires little hyperparameter tuning and achieves human-level adversarial test accuracy - with no form of adversarial training! These results and the general nature of the approach warrant further research on semantic features. The code is available at this https URL

I'll quote a paragraph from Section 1.2, "The core idea"

This paper introduces semantic features as a general idea for sharing weights inside a neural network layer. [...] The concept is similar to that of "inverse rendering" in Capsule Networks where features have many possible states and the best-matching state has to be found. Identifying different states by the subsequent layers gives rise to controlled dimensionality reduction. Thus semantic features aim to capture the core characteristic of any semantic entity - having many possible states but being at exactly one state at a time. This is in fact a pretty strong regularization. As shown in this paper, choosing appropriate layers of semantic features for the [Minimum Viable Dataset] results in what can be considered as a white box neural network.

mishka1mo10

Is there a simple way to run against a given Kaggle competition after that particular competition is over?

These are reasonable benchmarks, but is there a way to make them less of a moving target, so that the ability to run against a given competition extends into the future?

Load More