mesaoptimizer

https://mesaoptimizer.com

 

learn math or hardware

Wiki Contributions

Comments

I think Twitter systematically underpromotes tweets with links external to the Twitter platform, so reposting isn't a viable strategy.

Thanks for the link. I believe I read it a while ago, but it is useful to reread it from my current perspective.

trying to ensure that AIs will be philosophically competent

I think such scenarios are plausible: I know some people argue that certain decision theory problems cannot be safely delegated to AI systems, but if we as humans can work on these problems safely, I expect that we could probably build systems that are about as safe (by crippling their ability to establish subjunctive dependence) but are also significantly more competent at philosophical progress than we are.

Leopold's interview with Dwarkesh is a very useful source of what's going on in his mind.

What happened to his concerns over safety, I wonder?

He doesn't believe in a 'sharp left turn', which means he doesn't consider general intelligence to be a discontinuous (latent) capability spike such that alignment becomes significantly more difficult after it occurs. To him, alignment is simply a somewhat harder empirical techniques problem like capabilities work is. I assume he imagines in behavior similar to current RLHF-ed models even as frontier labs have doubled or quadrupled the OOMs of optimization power applied to the creation of SOTA models.

He models (incrementalist) alignment research as "dual use", and therefore effectively models capabilities and alignment as effectively the same measure.

He also expects humans to continue to exist once certain communities of humans achieve ASI, and imagines that the future will be 'wild'. This is a very rare and strange model to have.

He is quite hawkish -- he is incredibly focused on China not stealing AGI capabilities, and believes that private labs are going to be too incompetent to defend against Chinese infiltration. He prefers that the USGOV would take over the AGI development such that they can race effectively against AGI.

His model for take-off relies quite heavily on "trust the trendline" and estimating linear intelligence increases with more OOMs of optimization power (linear with respect to human intelligence growth from childhood to adulthood). Its not the best way to extrapolate what will happen, but it is a sensible concrete model he can use to talk to normal people and sound confident and not vague -- a key skill if you are an investor, and an especially key skill for someone trying to make it in the SF scene. (Note he clearly states in the interview that he's describing his modal model for how things will go and he does have uncertainty over how things will occur, but desires to be concrete about what is his modal expectation.)

He has claimed that running a VC firm means he can essentially run it as a "think tank" too, focused on better modeling (and perhaps influencing) the AGI ecosystem. Given his desire for a hyper-militarization of AGI research, it makes sense that he'd try to steer things in this direction using the money and influence he will have and build, as a founder of n investment firm.

So in summary, he isn't concerned about safety because he prices it in as something about as difficult (or slightly more difficult than) capabilities work. This puts him in an ideal epistemic position to run a VC firm for AGI labs, since his optimism is what persuades investors to provide him money since they expect him to attempt to return them a profit.

Oh, by that I meant something like "yeah I really think it is not a good idea to focus on an AI arms race". See also Slack matters more than any other outcome.

If Company A is 12 months from building Cthulhu, we fucked up upstream. Also, I don't understand why you'd want to play the AI arms race -- you have better options. They expect an AI arms race. Use other tactics. Get into their OODA loop.

Unsee the frontier lab.

These are pretty sane takes (conditional on my model of Thomas Kwa of course), and I don't understand why people have downvoted this comment. Here's an attempt to unravel my thoughts and potential disagreements with your claims.

AGI that poses serious existential risks seems at least 6 years away, and safety work seems much more valuable at crunch time, such that I think more than half of most peoples’ impact will be more than 5 years away.

I think safety work gets less and less valuable at crunch time actually. I think you have this Paul Christiano-like model of getting a prototypical AGI and dissecting it and figuring out how it works -- I think it is unlikely that any individual frontier lab would perceive itself to have the slack to do so. Any potential "dissection" tools will need to be developed beforehand, such as scalable interpretability tools (SAEs seem like rudimentary examples of this). The problem with "prosaic alignment" IMO is that a lot of this relies on a significant amount of schlep -- a lot of empirical work, a lot of fucking around. That's probably why, according to the MATS team, frontier labs have a high demand for "iterators" -- their strategy involves having a lot of ideas about stuff that might work, and without a theoretical framework underlying their search path, a lot of things they do would look like trying things out.

I expect that once you get AI researcher level systems, the die is cast. Whatever prosaic alignment and control measures you've figured out, you'll now be using that in an attempt to play this breakneck game of getting useful work out of a potentially misaligned AI ecosystem, that would also be modifying itself to improve its capabilities (because that is the point of AI researchers). (Sure, its easier to test for capability improvements. That doesn't mean you can't transfer information embedded into these proposals such that modified models will be modified in ways the humans did not anticipate or would not want if they had a full understanding of what is going on.)

Mentorship for safety is still limited. If you can get an industry safety job or get into MATS, this seems better than some random AI job, but most people can’t.

Yeah -- I think most "random AI jobs" are significantly worse for trying to do useful work in comparison to just doing things by yourself or with some other independent ML researchers. If you aren't in a position to do this, however, it does make sense to optimize for a convenient low-cognitive-effort set of tasks that provides you the social, financial and/or structural support that will benefit you, and perhaps look into AI safety stuff as a hobby.

I agree that mentorship is a fundamental bottleneck to building mature alignment researchers. This is unfortunate, but it is the reality we have.

Funding is also limited in the current environment. I think most people cannot get funding to work on alignment if they tried? This is fairly cruxy and I’m not sure of it, so someone should correct me if I’m wrong.

Yeah, post-FTX, I believe that funding is limited enough that you have to be consciously optimizing for getting funding (as an EA-affiliated organization, or as an independent alignment researcher). Particularly for new conceptual alignment researchers, I expect that funding is drastically limited since funding organizations seem to explicitly prioritize funding grantees who will work on OpenPhil-endorsed (or to a certain extent, existing but not necessarily OpenPhil-endorsed) agendas. This includes stuff like evals.

The relative impact of working on capabilities is smaller than working on alignment—there are still 10x as many people doing capabilities as alignment, so unless returns don’t diminish or you are doing something unusually harmful, you can work for 1 year on capabilities and 1 year on alignment and gain 10x.

This is a very Paul Christiano-like argument -- yeah sure the math makes sense, but I feel averse to agreeing with this because it seems like you may be abstracting away significant parts of reality and throwing away valuable information we already have.

Anyway, yeah I agree with your sentiment. It seems fine to work on non-SOTA AI / ML / LLM stuff and I'd want people to do so such that they live a good life. I'd rather they didn't throw themselves into the gauntlet of "AI safety" and get chewed up and spit out by an incompetent ecosystem.

Safety could get even more crowded, which would make upskilling to work on safety net negative. This should be a significant concern, but I think most people can skill up faster than this.

I still don't understand what causal model would produce this prediction. Here's mine: One big limiting factor to the amount of safety researchers the current SOTA lab ecosystem can handle is bottlenecked by their expectations for how many researchers they want or need. On one hand, more schlep during pre-AI-researcher-era means more hires. On the other hand, more hires requires more research managers or managerial experience. Anecdotally, it seems like many AI capabilities and alignment organizations (both in the EA space and in the frontier lab space) seemed to have been historically bottlenecked on management capacity. Additionally, hiring has a cost (both the search process and the onboarding), and it is likely that as labs get closer to creating AI researchers, they'd believe that the opportunity cost of hiring continues to increase.

Skills useful in capabilities are useful for alignment, and if you’re careful about what job you take there isn’t much more skill penalty in transferring them than, say, switching from vision model research to language model research.

Nah, I found very little stuff from my vision model research work (during my undergrad) contributed to my skill and intuition related to language model research work (again during my undergrad, both around 2021-2022). I mean, specific skills of programming and using PyTorch and debugging model issues and data processing and containerization -- sure, but the opportunity cost is ridiculous when you could be actually working with LLMs directly and reading papers relevant to the game you want to play. High quality cognitive work is extremely valuable and spending it on irrelevant things like the specifics of diffusion models (for example) seems quite wasteful unless you really think this stuff is relevant.

Capabilities often has better feedback loops than alignment because you can see whether the thing works or not. Many prosaic alignment directions also have this property. Interpretability is getting there, but not quite. Other areas, especially in agent foundations, are significantly worse.

Yeah this makes sense for extreme newcomers. If someone can get a capabilities job, however, I think they are doing themselves a disservice by playing the easier game of capabilities work. Yes, you have better feedback loops than alignment research / implementation work. That's like saying "Search for your keys under the streetlight because that's where you can see the ground most clearly." I'd want these people to start building the epistemological skills to thrive even with a lower intensity of feedback loops such that they can do alignment research work effectively.

And the best way to do that is to actually attempt to do alignment research, if you are in a position to do so.

It seems like a significant amount of decision theory progress happened between 2006 and 2010, and since then progress has stalled.

You are missing providing a ridiculous amount of context, but yes, if you are okay with leather footwear, Meermin provides great footwear at relatively inexpensive prices.

I still recommend thrift shopping instead. I spent 250 EUR on a pair of new noots from Meermin, and 50 EUR on a pair of thrifted boots which seem about 80% as aesthetically pleasing as the first (and just as comfortable since I tried them on before buying them).

It has been six months since I wrote this, and I want to note an update: I now grok what Valentine is trying to say and what he is pointing at in Here's the Exit and We're already in AI takeoff. That is, I have a detailed enough model of Valentine's model of the things he talks about, such that I understand the things he is saying.

I still don't feel like I understand Kensho. I get the pattern of the epistemic puzzle he is demonstrating, but I don't know if I get the object-level thing he points at. Based on a reread of the comments, maybe what Valentine means by Looking is essentially gnosis, as opposed to doxa. An understanding grounded in your experience rather than an ungrounded one you absorbed from someone else's claims. See this comment by someone else who is not Valentine in that post:

The fundamental issue is that we are communicating in language, the medium of ideas, so it is easy to get stuck in ideas. The only way to get someone to start looking, insofar as that is possible, is to point at things using words, and to get them to do things. This is why I tell you to do things like wave your arms about or attack someone with your personal bubble or try to initiate the action of touching a hot stove element.

Alternately, Valentine describes the process of Looking as "Direct embodied perception prior to thought.":

Most of that isn’t grounded in reality, but that fact is hard to miss because the thinker isn’t distinguishing between thoughts and reality.

Looking is just the skill of looking at reality prior to thought. It’s really not complicated. It’s just very, very easy to misunderstand if you fixate on mentally understanding it instead of doing it. Which sadly seems to be the default response to the idea of Looking.

I am unsure if this differs from mundane metacognitive skills like "notice the inchoate cognitions that arise in your mind-body, that aren't necessarily verbal". I assume that Valentine is pointing at a certain class of cognition, one that is essentially entirely free of interpretation. Or perhaps before 'value-ness' is attached to an experience -- such as "this experience is good because <elaborate strategic chain>" or "this experience is bad because it hurts!"

I understand how a better metacognitive skillset would lead to the benefits Valentine mentioned, but I don't think it requires you to only stay at the level of "direct embodied perception prior to thought".

As for kensho, it seems to be a term for some skill that leads you to be able to do what romeostevensit calls 'fully generalized un-goodharting':

I may have a better answer for the concrete thing that it allows you to do: it’s fully generalizing the move of un-goodharting. Buddhism seems to be about doing this for happiness/​inverse-suffering, though in principle you could pick a different navigational target (maybe).

Concretely, this should show up as being able to decondition induced reward loops and thus not be caught up in any negative compulsive behaviors.

I think that "fully generalized un-goodharting" is a pretty vague phrase and I could probably come up with a better one, but it is an acceptable pointer term for now. So I assume it is something like 'anti-myopia'? Hard to know at this point. I'd need more experience and experimentation and thought to get a better idea of this.

I believe that Here's the Exit, We're already in AI Takeoff, and Slack matters more than any outcome all were pointing at the same cluster of skills and thought -- about realizing the existence of psyops, systematic vulnerabilities or issues that leads you (whatever 'you' means) to forgetting the 'bigger picture', and that the resulting myopia causes significantly bad outcomes from the perspective of the 'whole' individual/society/whatever.

In general, Lexicogenesis seems like a really important sub-skill for deconfusion.

I've experimented with Claude Opus for simple Ada autoformalization test cases (specifically quicksort), and it seems like the sort of issues that make LLM agents infeasible (hallucination-based drift, subtle drift caused by sticking to certain implicit assumptions you made before) are also the issues that make Opus hard to use for autoformalization attempts.

I haven't experimented with a scaffolded LLM agent for autoformalization, but I expect it won't go very well either, primarily because scaffolding involves attempts to make human-like implicit high-level cognitive strategies into explicit algorithms or heuristics such as tree of thought prompting, and I expect that this doesn't scale given the complexity of the domain (sufficently general autoformalizing AI systems can be modelled as effectively consequentialist, which makes them dangerous). I don't expect for a scaffolded (over Opus) LLM agent to succeed at autoformalizing quicksort right now either, mostly because I believe RLHF tuning has systematically optimized Opus to write the bottom line first and then attempt to build or hallucinate a viable answer, and then post-hoc justify the answer. (While steganographic non-visible chain-of-thought may have gone into figuring out the bottom line, it still is worse than first doing visible chain-of-thought such that it has more token-compute-iterations to compute its answer.)

If anyone reading this is able to build a scaffolded agent that autoformalizes (using Lean or Ada) algorithms of complexity equivalent to quicksort reliably (such that more than 5 out of 10 of its attempts succeed) within the next month of me writing this comment, then I'd like to pay you 1000 EUR to see your code and for an hour of your time to talk with you about this. That's a little less than twice my current usual monthly expenses, for context.

Load More