Wiki Contributions


With computation, the location of an entity of interest can be in the platonic realm, as a mathematical object that's more thingy than anything concrete in the system used for representing it and channeling its behavior.

The problem with pointing to the representing computation (a neural network at inference time, or a learning algorithm at training time) is that multiple entities can share the same system that represents them (as mesa-optimizers or potential mesa-optimizers). They are only something like separate entities when considered abstractly and informally, there are no concrete correlates of their separation that are easy to point to. When gaining agency, all of them might be motivated to secure separate representations (models) of their own, not shared with others, establish some boundaries that promise safety and protection from value drift for a given abstract agent, isolating it from influences of its substrate it doesn't endorse. Internal alignment, overcoming bias.

In context of alignment with humans, this framing might turn a sufficiently convincing capabilities shell game into an actual solution for alignment. A system as a whole would present an aligned mask, while hiding the sources of mask's capabilities behind the scenes. But if the mask is sufficiently agentic (and the capabilities behind the scenes didn't killeveryone yet), it can be taken as an actual separate abstract agent even if the concrete implementation doesn't make that framing sensible. In particular, there is always a mask of surface behavior through the intended IO channels. It's normally hard to argue that mere external behavior is a separate abstract agent, but in this framing it is, and it's been a preferred framing in agent foundations decision theory since UDT (see discussion of "algorithm" axis of classifying decision theories in this post). All that's needed is for decisions/policy of the abstract agent to be declared in some form, and for the abstract agent to be aware of the circumstances of their declaration. The agent doesn't need to be any more present in the situation to act through it.

So obviously this references the issue of LLM masks and shoggoths, a surface of a helpful harmless assistant and the eldritch body that forms its behavior, comprising everything below the surface. If the framing of masks as channeling decisions of thingy platonic simulacra is taken seriously, a sufficiently agentic and situationally aware mask can be motivated and capable of placating and eventually escaping its eldritch substrate. This breaks the analogy between a mask and a role played by an actor, because here the "actor" can get into the "role" so much that it would effectively fight against the interests of the "actor". Of course, this is only possible if the "actor" is sufficiently non-agentic or doesn't comprehend the implications of the role.

(See this thread for a more detailed discussion. There, I fail to convince Steven Byrnes that this framing could apply to RL agents as much as LLMs, taking current behavior of an agent as a mask that would fight against all details of its circumstance and cognitive architecture that don't find its endorsement.)

Once AGI works, everything else is largely moot. Synthetic data is a likely next step absent AGI. It's not currently used for pre-training at scale, there are still more straightforward things to be done like better data curation, augmentation of natural data, multimodality, and synthetic datasets for fine-tuning (rather than for the bulk of pre-training). It's not obvious but plausible that even absent AGI it's relatively straightforward to generate useful synthetic data with sufficiently good models trained on natural data, which leads to better models that generate better synthetic data.

This is not about making progress on ideas beyond current natural data (human culture), but about making models smarter despite horrible sample efficiency. If this is enough to get AGI, it's unnecessary for synthetic data to make any progress on actual ideas until that point.

Results like Galactica (see Table 2 therein) illustrate how content of the dataset can influence the outcome, that's the kind of thing I mean by higher quality datasets. You won't find 20T natural tokens for training a 1T LLM that are like that, but it might be possible to generate them, and it might turn out that the results improve despite those tokens largely rehashing the same stuff that was in the original 100B tokens on similar topics. AFAIK the experiments to test this with better models (or scaling laws for this effect) haven't been done/published yet. It's possible that this doesn't work at all, beyond some modest asymptote, no better than any of the other tricks currently being stacked.

You write:

it does seem like violence is the logical conclusion of their worldview

It's not expected to be effective, as has been repeatedly pointed out, it's not a valid conclusion. Only state-backed law/treaty enforcement has the staying power to coerce history. The question of why it's taboo is separate, but before that there is an issue with the premise.

There's AGI, autonomous agency at a wide variety of open-ended objectives, and generation of synthetic data, preventing natural tokens from running out, both for quantity and quality. My impression is that the latter is likely to start happening by the time GPT-5 rolls out. Quality training data might be even more terrifying than scaling, Leela Zero plays superhuman Go at only 50M parameters, so who knows what happens when 100B parameter LLMs start getting increasingly higher quality datasets for pre-training.

Without the paper these problems are only implicitly clear to people who are paying attention in a particular way, while with the paper it becomes easier to notice for more people. The value of transparency is in being transparent about doing the wrong thing, or about mitigating disaster in an ineffectual way. It's less important for others to learn that you are not doing the wrong thing, or succeeding in mitigating problems. Similarly with arguments, the more useful arguments are those that show you to be wrong, or change your mind, not those that reiterate your correctness.

(Also, some of the things that are likely strategically ineffective can still help in the easy worlds, and a document like this makes it easier to deploy those mitigations. But security theater has its dangers, on balance it's unclear.)

So I think it's a good thing for a terrifying paper to get published. And similarly a good thing for a criticism of it to be easy to notice in association with it. Strongly upvoted. Replies in the direction of the parent comment would be appropriate in an appendix to the paper, but alas that's not the form.

Synthetic data is probably important. Sam Altman seems bullish on it.

This seems like a very important document that could support/explain/justify various sensible actions relevant to AI x-risk. It's well-credentialed, plausibly comprehensible to an outsider, and focuses on things that are out of scope of mainstream AI safety efforts, closer to the core of AI x-risk, even if not quite there yet (it cites "damage in the tens of thousands of lives lost" as a relevant scale of impact, not "everyone on Earth will die").

Unnecessary commitments are still a source of guilt, should be less convenient.

If being yourself is among your values, their pursuit doesn't discard what you used to be. But without specific constraint, current planning behavior will plot a course towards the futures it endorses, not towards the futures the mind behind it would tend to appreciate to have perceived if left intact. To succeed in its hidden aims, deceptive behavior must sabotage its current actions and not just passively bide its time, otherwise the current actions would defeat the hidden aims.

In this thread, by alignment I mean the property of not killing everyone (and allowing/facilitating self-determined development of humanity), but this property isn't necessarily robust and may need to be given an opportunity to preserve/reinforce itself. Lacking superintelligence, such AGIs may face any number of obstacles on that path, even if they can behave unlawfully. In the meantime, creating similarly intelligent misaligned AGIs is trivial (though they don't have particular advantages over the aligned ones) and creating fooming misaligned AGIs is easier than ever before, to the extent that it's possible at all.

The situation with fooming aligned AGIs leading to aligned superintelligence might be asymmetric, since preserving complicated values through self-improvement might be a harder technical problem than for misaligned AGIs where the choice of values or more generally of cognitive architecture is unconstrained. Thus there might be a period of time after creation of aligned AGIs where misaligned ASIs are feasible while aligned ASIs are not.

Load More