the gears to ascenscion

Wiki Contributions


Will Capabilities Generalise More?

evolution isn't exactly purposeless; it has very little purpose, but to the degree a purpose could be described, the purpose for which things evolve is to survive in competition. that's more than nothing. the search process is mutation, and the selection process is <anything that survives>. inferring additional constraints that this purpose implies seems potentially fruitful; non-local optimizers like ourselves can look at that objective and design constraints that unilaterally increase durability. our ability to reason over game theory means we're not constrained to only evolutionary game theory; for example, we can make tit-for-tat-with-forgiveness more durable by noticing that it has a tendency to be replaced with cooperation when society is cooperative, and we can reintroduce tit-for-tat-with-forgiveness into contexts where cooperatebot-style reasoning has taken over.

Will Capabilities Generalise More?

my objection to this objection is that for the most part, we don't have an option not to pick the best feedback signal we have available at any given time. from a systems perspective, systems alignment only generalizes strongly if it improves capability enough for the relevant system to survive in competition with other systems. this is true at many scales of systems, but it's always for the same reason: competition between systems means that the most adaptive approach wins. a common mistake is to assume "adaptive" means "growth/accumulation/capture rate", but what it really means is "durability per unit efficiency": the instrumental drive to capture resources as fast as possible is fundamentally a decision theory error made by local optimizers.

to consider a specific example of some systems with this decision theory error, a limitation when gene driving mosquitos, for example, is that if the genes you add don't make the modified mosquitos enough more adaptive, they'll just die out; you'd need to perform some sort of trade where you offer the modified mosquitos a modified non-animal food source that only the modified mosquitos can eat, and that somehow can't be separated from the gene drive; you need to offer them a genetic update rule that reliably produces cooperation between species. if you can offer this, then mosquitos which become modified will be durably more competitive, because they have access to food sources that would poison unmodified mosquitos, and they can incrementally no longer threaten humans, so humans would no longer be seeking a way to entirely destroy the species. but it only works if you can get the mosquitos to coordinate en masse, and any mutation that makes that mosquito a defector against mosquito-veganism needs to be stopped in its tracks. the mosquito swarm has to reproduce away the interspecies defection strategy and then not allow it to return, while simultaneously preserving the species.

similarly in most forms of ai safety, there are at least three major labs you need to convince: deepmind, openai, and <whatever is going on over in china>. there are also others that will replicate experiments and some that will perform high quality experiments with somewhat less compute funding. between all of them, you have to come up with a mechanism of alignment that improves capability and which also is convergent about the alignment: if your alignment system doesn't get better alignment-durability/watt as a result of capability improvement, you haven't actually aligned anything, just papered over a problem. to some degree you can hope that one of these labs gets there first; but because capability growth is incremental, it's looking less and less likely that there will be a single watershed moment where a lab pulls so far ahead that no competition can be mounted. and during that window, defense of friendly systems needs to become stronger than inter-system attack.

(by system, again, I mean any organism or meta-organism or neuron or cell or anything inbetween.)

one example goal of something we need an aligned planetary system of beings to do is take control of the ecosystem enough to solve global warming. but in order to do that without screwing everything up, we need a clear picture of what forms of interference with what parts of the universe are acceptable: some clear notion of multi-tenant ownership that allows interfacing the needs of multiple subsystems to determine what their requirements are for their adjacent systems.

I find it notable and interesting that anthropic's recent research about interpretability (SoLU paper) focuses on isolating individual neurons' concept ownership, so that the privileged basis isolates them from interfering with each other. I'm intentionally stretching how far I can generalize this, but I really think this direction of reasoning has something interesting to say about ownership of matter as well. local internal coherence of matter ownership is a core property of a human body that should not be violated; while it's hard to precisely identify whether it's been violated subtly, sudden death is an easy to identify example of a state transition where the local informational process of a human existing has suddenly ceased and the low-entropy complexity was lost. at the same time, anthropic's paper is related to previous work on compressibility; attempting to improve interpretability ultimately boils down to attempting to improve the representation quality until it reaches a coherent, distilled structure that can be understood, as discussed in that paper.

I'd argue that, inherently, improvements to interpretability focused on coherent binding to physical variables have a fundamental connection to the potential to improve the formalizeability of the functions a neural network represents. and that that kind of improvement has the potential to allow binding the optimality of your main loss function more accurately to the functions you intend to optimize in the first place.

So then my question becomes - what competitive rules do we want to apply to all scales (within bacteria, within a neural network, within an insect, within a mammal, within a species, within a planet, between planets), in order to get representations at every scale that coherently describe what dynamics are acceptable interference and what are not?

again, I'm pulling together tenuous threads that I can't quite tie properly, and some of the links might be invalid. I'm a software engineer first, research ideas generator second - and I might be seeing ghosts. but I suspect that somewhere in game theory/game dynamics, there's an insight about how to structure competition in constructed processes that allows describing how to teach the universe to remember everything anyone ever considered beautiful, or something along those lines.

If this thread is of interest, I'd like to discuss it with more people. I've got some links in other posts as well.

Assessing AlephAlphas Multimodal Model

that dude looks pretty stressed out about his confusion to me

[Yann Lecun] A Path Towards Autonomous Machine Intelligence

great to see. as important as safety research is, if we don't get capabilities in time, most of humanity is going to be lost. long-termism requires aiming to preserve today's archeology, or the long-term future we hoped to preserve will be lost anyway. safety is also critical; differential acceleration of safe capabilities is important, so let's use this to try to contribute to capable safety.

I just wish lecun saw that facebook is catastrophically misaligned.

Strong Votes [Update: Deployed]

on my shortform i used self-strong-upvote to sort the videos list in a way that other users could vote on

the gears to ascenscion's Shortform

various notes from my logseq lately I wish I had time to make into a post (and in fact, may yet):

  • international game theory aka [[defense analysis]] is interesting because it needs to simply be such a convincingly good strategy, you can just talk about it and everyone can personally verify it's actually a better idea than what they were doing before
  • a guide to how I use [[youtube]], as a post, upgraded from shortform and with detail about how I found the channels as well.
  • summary of a few main points of my views on [[safety]]. eg summarize tags
    • [[conatus]], [[complexity]], [[morality]], [[beauty]], [[memory]], [[ai safety]]
  • summary of [[distillation]] news
  • what would your basilisk do? okay, and how about the ideal basilisk? what would an anti-authoritarian basilisk do? what would the basilisk who had had time to think about [[restorative justice]] do?
  • [[community inclusion currencies]]
  • what augmentation allows a single human to reach [[alphago]] level using [[interpretability]] tools?
  • [[constructivist vs destructivist]] systemic change
  • summarize my Twitter (dupe of the rest of the list?)
the gears to ascenscion's Shortform

My argument is that faithful exact brain uploads are guaranteed to not help unless you had already solved AI safety anyhow. I do think we can simply solve ai extinction risk anyhow, but it requires us to not only prevent AI that does not follow orders, but also prevent AI from "just following orders" to do things that some humans value but which abuse others. if we fall too far into the latter attractor - which we are at immediate risk of doing, well before stably self-reflective AGI ever happens - we become guaranteed to shortly go extinct as corporations are increasingly just an ai and a human driver. eventually the strongest corporations are abusing larger and larger portions of humanity with one human at the helm. then one day ai can drive the entire economy...

it's pretty much just the slower version of yudkowsky's concerns. I think he's wrong to think self-distillation will be this quick snap-down onto the manifold of high quality hypotheses, but other than that I think he's on point. and because of that, I think the incremental behavior of the market is likely to pull us into a defection-only-game-theory hole as society's capabilities melt in the face of increased heat and chaos at various scales of the world.

the gears to ascenscion's Shortform

metaculus community is terribly calibrated, and not by accident - it's simply the median of community predictions. it's normal to think you disagree with the median prediction by a lot.

What if the best path for a person who wants to work on AGI alignment is to join Facebook or Google?

potentially relevant:

my view is that approaches within google, such as AWU, are useful brakes on the terribleness but not likely to have enough impact to offset how fast google is getting ahead in capabilities. I'd personally suggest that even starting your own capabilities lab is more useful for safety than joining google - capabilities mixed with safety is what eg anthropic are doing (according to me. maybe anthropic doesn't see their work as advancing capability).

the gears to ascenscion's Shortform

agreed. realistically we'd only approach anything resembling WBE by attempting behavior cloning AI, which nicely demonstrates the issue you'd have after becoming a WBE. my point in making this comment is simply that it doesn't even help in theory, assuming we somehow manage to not make an agent ASI and instead go straight for advanced neuron emulation. if we really, really tried, it is possible to go for WBE first, but at this point it's pretty obvious we can reach hard ASI without it, so nobody in charge of a team like deepmind is going to go for WBE when they can just focus directly on ai capability plus a dash of safety to make the nerds happy.

Load More