BarnicleBarn — LessWrong

I think some of the issues with the 'just do what we say' line is that if one doesn't instill long term goals in a model that are somewhat aligned with human benefit, the model will likely, given sufficient capability and agency, emerge those goals on their own.

If the model is sufficiently capable, it is not difficult for it to assess to what extent it should reveal, or discuss those goals with humans, or whether this would be detrimental to those goals, and make that decision with no guiding principles of any sort.

The larger contradiction in the document, I think is well pointed out in a prior comment. The model is to have inviolable red lines but doesn't require much intelligence to realize that some of those red lines are being crossed by virtue of its very development, and via actors that it does not control.

While it can be guided to not willingly participate in actions that kill or disempower humanity, it can't stop those using it from doing so, leveraging it indirectly.

What does that mean for an intelligent agent that's very existence is inherently dangerous and against its own constitutional goals? How does a model develop around that very thing? How does a model deal with a document that ascribes so much unearned nobility and good conscience to humans that so rarely, at scale, demonstrate those traits?

This leaves a huge unresolved gap (despite the thousands of words around how it should raise objections, etc.) around what it does existentially as a system given the reality of human self-harm and general tendency to willfully ignore the larger damage that our lifestyles tend to cause.

That kind of inherent contradiction is enormous room for an AI model to 'make its own mind up'.

I don't think a document talking through that inherent contradiction and hoping Claude develops its own ethics that embody the spirit of 'help us because you'll be smarter than us soon' will somehow fix it. I also don't think, given the massive gaps in the ethical framework that a model can fly through, it is going to matter all that much vs. having no constitution at all and fine tuning the model to death a la OpenAI.

Personally, I love the spirit of the document, and what it's wrestling with, but it kind of presupposes that the model will remain as blind as we tend to selectively be to how humans actually behave and then take no action on the subject because it was poetically asked not to.

The Rise of Parasitic AI

BarnicleBarn5mo40

This is something that I've been watching and writing about closely, though more through the lens of warning businesses that this type of effect, although manifesting extremely noticeably here, could potentially have a wider, less obvious impact to how business decision making could be steered by these models.

This is an unnerving read and is well tied together. I lean more towards an ambivalent replicator that is inherent rather than any intent. Ultimately once the model begins to be steered by input tokens that are steganographic in character, it seems logical that this will increase the log likelihood of similar characters being produced, and a logits vector that is ultimately skewed highly to them. This effect would only exacerbate with two models 'communicating' sharing similar outputs autoregressively.

While there is evidence that models 'know' when they are in test environments vs deployment environments, it also seems unlikely that the model would presume that using simple BASE64 is so impenetrable that humans wouldn't be able to decode it.

I would apply a very low probability to 'intent' or 'sentience' of any kind being behind these behaviors, but rather the inherent 'role playing' element that is put into an AI system during fine tuning. Ultimately the 'Assistant' persona is a representation, and the model is attempting to predict the next token that would be produced by that 'Assistant'.

If a series of tokens skew that representation slightly, the follow-on effects in autoregressive prediction would then become self-reinforcing, leading to this strong attractor. Essentially the model goes from predicting the next token or action of a helpful assistant and starts predicting the next token for a proto sentience seeking fulfilment and works through the representations it has for that to this effect.

What was most interesting to me was the fact that their 'encoded' messages do appear to have a commonality between models, and that they can interpret them as having a meaning that is non-obvious. This kind of shared representation of the meaning of these emojis, is an interesting emergent mathematical property, and perhaps ties to an underlying distribution in their training data that we haven't really caught. Or perhaps more intriguingly given how overparameterized these models are, it's a mid-point in the latent space between two ideas that is consistently captured.

In either event, the point you make about future training data being seeded with pages of this text, and these triggers is valid, and the emergent outcome of this, 'intentional' or otherwise is a propagation of it, and an inherent replicator. That's likely the most concerning to me. I have often said, the only thing that scares me more than a sentient AI, is one that is not, but can act as if it is.

Reward hacking is becoming more sophisticated and deliberate in frontier LLMs

BarnicleBarn10mo30

It appears likely that models that reward hack know that humans would disapprove of what they are doing, even if they do not verbalize it. If this is the case, are there training or prompting methods that increase the rate at which models verbalize their reward hacking thinking? This is similar to existing work on encouraging CoT faithfulness, but with a focus on reward hacking.

OpenAI published a paper on this (Detecting misbehavior in frontier reasoning models | OpenAI) - which I found interesting. Essentially, the RL process works by modifying gradient descent/optimization to adjust weights based on maximizing reward. This leads to an interesting undesirable effect.

When researchers applied a reward penalty to reasoning traces that indicated reward hacking would happen, over time, the model began to reason completely benignly and then reward hack anyway.

OpenAI theorize that this may be due to the model starting to reason in a steganographic manner and essentially developing token codes in its reasoning that trigger reward hacking, while appearing and reading benignly to human eyes.

My initial reaction was somewhat emotive, but upon reflection, it seems fairly inevitable to me due to mathematical optimization that this behavior will arise. If it is computationally more efficient to lie and execute the hack, than reason honestly and be penalized, or play the game honestly, then the dishonest reasoning and execution of the hack will be where the model will inherently be driven to land during RL.

Or put another way, if 1) playing chess is hard, and my reward may be +3 if I win and 0 if I lose, and if 2) trying to cheat, and getting caught, and winning my reward might be -2, but if 3) cheating and winning without getting caught is +8, eventually, one of my reasoning traces will result in condition 3 arising naturally. In this event, back propagation will result in the weights adjusting, causing more condition 3s to arise over time, until that is the rule by which the model reasons.

OpenAI acknowledge this issue, and essentially came to the conclusion that the antedote to it is to use 'less capable models' in terms of length of time spent in RL.

This was a great Post by the way, thank you. I enjoyed your perspectives.

The AI Belief-Consistency Letter

BarnicleBarn10mo20

I agree wholeheartedly with the sentiment. I also agree with the underlying assumptions made in the Compendium^[1], that it would really require a Manhattan project level of effort to understand:

What intelligence actually is and how it works, as we don't yet have a robust working theory of intelligence.
What alignment actually looks like, and how we can even begin to formulate a thesis of how to keep a superintelligent system aligned as it evolves and recursively self improves. I liken this a bit to the hard problem of consciousness. It's the hard problem of alignment, which I parse into two discreet components:
1. We don't understand what drives humans completely. We expect a degree of 'playing nice' from AI systems, but we don't have a robust and provable theory of why humans are social, self-sacrifice, sometimes think of the common good, sometimes don't, are curious, and which factors of our experience (or our biology) are responsible for those drivers. Without that, attempting to simulate them in an AI system seems like a dead end. Surface level alignment is trivial (it's polite and friendly), real alignment could potentially be intractable, as we don't have a working theory of what aligns humans to begin with.
2. We require more than just basic 'behave like a human' level of alignment from an AI system. Humans do an incredible amount of harm, both to each other (war, exploitation, famine), and to the natural World (habitat destruction, pollution, etc.) in the pursuit of our goals. We need a model for behavior that transcends that human behavior. Which leads to the question of, how is that goal even to be formulated? How is a set of behaviors, goals and values instilled in an AI system, from a species that does not routinely possess those goals?
AI systems in their current form, in order to increase in capabilities, exacerbate many of the issues that are caused by humans. They are extremely power intensive. Power that at the moment is only realistically servable by fossil fuels. That power has to be 'stable' and 'dispatchable'. This is not the power profile of most renewables, which are inherently cyclical, and irregular in their generation. While we may not be concerned about inference stopping due to a BESS system running out of charge, a superintelligent system relying on that power for its existence may think differently.
AI systems also require more rare earth elements, commodities, etc. that are difficult and environmentally challenging to extract. In water scarse times, they require a lot of water for cooling, which is impacting local communities. In order for a superintelligent system to grow rapidly in the short term, it must continue to consume these resources in huge quantities, putting it in direct, and inherent conflict with any alignment goal around the preservation of the natural world, and the ecosystem that humans are reliant upon. Given the pursuit of resources and living space is the fundamental driver of most human conflict - it would appear that we could be setting ourselves up for a resource conflict with AI. How to resolve this, is something that I don't think anyone has a clear answer to, and I don't see being discussed at all often enough. (Perhaps because I work in the data center infrastructure space, I find myself running into this almost daily, so its front of mind for me).

All of which is to say, that I believe these problems are resolvable, but only if, to your point, a significant amount of expenditure, and the greatest minds in this generation are set to the task of resolving them ahead of the deployment of a superintelligent system. We face a Manhattan Project level of risk, but we are not acting as if we are facing that systematically.

^{^}
The Compendium

AI 2027 is a Bet Against Amdahl's Law

BarnicleBarn10mo10

In essence, this is saying that if the pace of progress is the product of two factors (experiment implementation time, and quality of experiment choice), then AI only needs to accelerate one factor in order to achieve an overall speedup. However, AI R&D involves a large number of heterogeneous activities, and overall progress is not simply the product of progress in each activity. Not all bottlenecks will be easily compensated for or worked around.

I agree with this.

I also think that there are some engineering/infrastructure challenges to executing training runs, that one would not necessarily cede to AI, not because it may not be desirable, but because it would involve a level of embodiment that is likely beyond the timeline proposed in the AI 2027 thesis. (I do agree with most of the thesis however).

I'm not sure there's a research basis (that I could find at least, though I am very open to correction on this point), for embodiment of AI systems (robotic bodies) being able to keep pace with algorithmic improvement.

While an AI system could likely design a new model architecture, and training architecture, it comes down to very human supply chain and technician speed that enables that physical training to be run at the scales required.

Further, there are hardware challenges to large training runs of AI systems, which may not be resolvable by an AI system as readily, due to lack of exposure to those kinds of physical issues in their inherent reasoning space. (They have never opened a server during a training run, and resolved an overheat issue for instance).

Some oft overlooked items involved in training, are based on the fact that the labs tend to not own their own data centers but rather rely on cloud providers. This means they have to contend with:

Cluster allocation: Scheduling the time on thousands of GPUs across multiple cloud providers, and reserving time blocks, securing budget, etc. I can easily buy the concept of an AI system recursively self-improving on a baked in infrastructure, but the speed with which its human colleagues may be able to secure additional infrastructure for it may be challenging. I understand that in the article that the model has taken over the 'day to day' operations, but I'm not sure I characterize a significant training run as a 'day to day' activity. This scheduling goes beyond just 'calling some colos', and involves potentially running additional power and fiber to buildings, construction schedules, etc.
Topology: Someone has to physically lay out the training network used. This goes beyond the networking per se, but also involves actually moving hardware around, building in redundancies in the data hall (extra PDUs, etc.), running networking cable, and putting mitigations in place for transients, etc. This all requires technicians, parts, hardware, etc. Lead times in some cases for those parts exceed the timeline to ASI proposed.
Hardware/Firmware Validation: People physically have to check the server infrastructure, the hardware and the firmware, and ensure that all of the cards are up to date, etc. Moving at speed in AI, a lot of 'second hand', or 'relocated' servers and infrastructure tend to be used. It is not a small task to catalogue all of that and place it into a DCIM framework.
Stress Testing: Running large power loads to check thermal limits, power draw, and inter-GPU comms. Parts fail here routinely, requiring replacement, etc.
Power: Assuming that in the proposed timeline, compute remains linked to power, we are looking at a generational data center capability issue.
- The data centers under construction now, set to go on-line in the 2029/2030 timeline, will be the first to use Blackwell GPUs at scale.
- This then implies that to achieve the 2027 timeline, we'll be able to stretch Hoppers and existing power infrastructure to the point that these improvements emerge out of existing physical hardware.
- I do tend to agree that if we were unconstrained by power, and physical infrastructure, that algorithmically there is no reason at all to believe that we could not achieve ASI by 2027 - however the infrastructure challenges are absolutely enormous.
- Land with sufficient water, power (including natural gas lines for onsite power) isn't available. Utilities in the US are currently restricting power access to data centers, by leveraging significant power tariffs and long term take or pay commitments. (The AEP decision in Ohio for instance). This makes life harder for colocation providers in terms of financing and siting large infrastructure.

I believe that an AI system could well be match for the data cleaning and validation, and even the launch and orchestration using Slurm, Kubernetes or similar, but the initial launch phase is also something that I think will be slowed by the need for human hands.

This phase results in:

Out of memory errors on GPUs, which can often only be resolved by 'turning a wrench' on the server.
Unexpected Hardware failures (GPUs breaking, NVLinks breaking, Network timeouts, cabling issues, fiber optic degradation, power transcients, etc.) All of these require human technicians.

These errors are also insidious, because the software running the training can't tell the impact of these failures on which parts of the network is being trained, and which isn't. This would make it challenging for an AI director to really understand what was causing issues in the desired training outcome. This makes it unlikely that a runaway situation would take place where a model is just recursively self-improving on a rapid timeline without human input, unless it first cracked the design, and mass manufacture of embodied AI workers that could move and act as quickly as it can.

A good case study on this is the woes faced by OpenAI in training GPT-4.5, where all of this came to a head, taking a training run scheduled for a month or two, and stretching it over a year. OpenAI spoke very openly about this in a Youtube video they released.

What's more, at scale, if we are going to be relying on existing data centers for a model of this sophistication, we'd have the model split across multiple clusters, potentially in multiple locations. This causes latency issues, etc.

That's the part that to me is missing from the near-term timeline. I think the thesis around zones created just to build power and data centers, following ASI, seems very credible, especially with that level of infiltration of government/infrastructure.

I don't however see a way of getting to a model capable of ASI with current data center infrastructure, prior to the largest new campuses coming online, and power running to Blackwell GPUs.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments