mesaoptimizer

https://mesaoptimizer.com

Wiki Contributions

Comments

I'm really glad you wrote this post, because Tsvi's post is different and touches on very different concepts! That post is mainly about fun and exploration being undervalued as a human being. Your post seems to have one goal: ensure that up-and-coming alignment researchers do not burn themselves out or hyperfocus on only one strategy for contributing to reducing AI extinction risk.

Note, this passage seems to be a bit... off to me.

This one is slightly different from the last because it is an injunction to take care of your mental health. You are more useful to us when you are not stressed. I won’t deny that you are personally responsible for the entire destiny of the universe, because I won’t lie to you: but we have no use for a broken tool.

People aren't broken tools. People have limited agency, and claiming they are "personally responsible for the entire destiny of the universe" is misleading. One must have an accurate sense of the agency and influence they have when it comes to reducing extinction risk if they want to be useful.

The notion that alignment researchers and people supporting them are "heroes" is a beautiful and intoxicating fantasy. One must be careful that it doesn't lead to corruption in our epistemics, just because we want to maintain our belief in this narrative.

Good point! I won't use Substack though, so if I read your post 24 hours after release I'll leave the typos be.

Nate Soares' point did not depend on complex systems dynamics causing tiny miscalibrations to blow up into massive issues. The entire point of that essay is to show how ontological shifts are a major problem for alignment robustness.

I expect that AIs will be good enough at epistemology to do competent error correction and the problems you seem overly focused on are irrelevant.

Do you believe that all attempts at alignment are flawed and that we should stop building powerful ASIs entirely? I can't quite get what your belief is.

I stated it in the comment you replied to:

Humanity is already less capable than the predecessor AI in my model, so trying to retain control would reliably lead to worse outcomes.

Natural abstractions are also leaky abstractions.

No, the way I used the term was to point to robust abstractions to ontological concepts. Here's an example: Say . here obviously means 2 in our language, but it doesn't change what represents, ontologically. If , then you have broken math, and that results in you being less capable in your reasoning and being "dutch booked". Your world model is then incorrect, and it is very unlikely that any ontological shift will result in such a break in world model capabilities.

Math is a robust abstraction. "Natural abstractions", as I use the term, points to abstractions for objects in the real world that share the same level of robustness to ontological shifts, such that as an AI gets better and better at modelling the world, its ontology tends more towards representing the objects in question with these abstractions.

Meaning that even* if* AGI could internally define a goal robustly with respect to natural abstractions, AGI cannot conceptually contain within their modelling of natural abstractions all but a tiny portion of the (side-)effects propagating through the environment – as a result of all the interactions of the machinery’s functional components with connected physical surroundings.

That seems like a claim about the capabilities of arbitrarily powerful AI systems, one that relies on chaos theory or complex systems theory. I share your sentiment but doubt that things such as successor AI alignment will be difficult for ASIs.

This because what we are dealing with is machinery that continues to self-learn code from inputs, and continues to self-modify by replacing broken parts (perfect hardware copies are infeasible).

Pretty sure that the problem of ensuring successor AIs are aligned to their predecessors is one that can be delegated to a capable and aligned AI. Asking for "perfect hardware copies" misses the point, in my opinion: it seems like you want me to accept that just because there isn't a 100% chance of AI-to-AI successor alignment, humanity must attempt to retain continued control over the AI. Humanity is already less capable than the predecessor AI in my model, so trying to retain control would reliably lead to worse outcomes.

Typos report:

"Rethink Priors is remote hiring a Compute Governance Researcher [...]" I checked and they still use the name Rethink Priorities.

"33BB LLM on a single 244GB GPU fully lossless" ->should be 33B, and 24GB

"AlpahDev from DeepMind [...]" -> should be AlphaDev

Could you link (or describe) a better explanation for why you believe that the Natural Abstraction Hypothesis (or a goal described in a way that is robust to ontological shifts; I consider both equivalent) is not a sound assumption? Because in such a case I believe we are mostly doomed. I don't expect the 'control problem' to be solvable or consider that it makes sense for humanity to be able to have a leash on something superintelligent that can have a shift in its preferences.

Assuming an inner aligned AI system (that is, an AI system with no misaligned inner optimizers), if we have a goal described in a way that is robust to ontological shifts due to the Natural Abstractions Hypothesis holding in some way (specifically, what I have in mind is formally specified goals like QACI, since I expect that mathematical abstractions are robust to ontological shifts), then one can simply[1] provide an this AI system this goal and allow it to do whatever it considers necessary to maximize that goal.

I do not believe this alignment strategy requires a control feedback loop at all. And I do believe that retaining control over an AI as it rapidly improves capabilities is perhaps a quixotic goal.

So no, I am not pointing at the distinction between 'implicit/aligned control' and 'delegated control' as terms used in the paper. From the paper:

Delegated control agent decides for itself the subject’s desire that is long-term-best for the subject and acts on it.

Well, in the example given above, the agent doesn't decide for itself what the subject's desire is: it simply optimizes for its own desire. The work of deciding what is 'long-term-best for the subject' does not happen unless that is actually what the goal specifies.


  1. For certain definitions of "simply". ↩︎

Also intuitively, in the latter case 5 of the data points “didn’t matter” in that you’d have had the same constraints (at that point) without them, and so this is kinda sorta like “information loss”.

I am confused: how can this be "information loss" when we are assuming that due to linear dependence of the data points, we necessarily have 5 extra dimensions where the loss is the same? Because 5 of the data points "didn't matter", that shouldn't count as "information loss" but more like "redundant data, ergo no information transmitted".

Load More