228

LESSWRONG
LW

227
AI CapabilitiesAI EvaluationsForecasts (Specific Predictions)METR (org)AI
Frontpage

13

Why Future AIs will Require New Alignment Methods

by Alvin Ånestrand
10th Oct 2025
Linkpost from forecastingaifutures.substack.com
6 min read
1

13

13

Why Future AIs will Require New Alignment Methods
1StanislavKrym
New Comment
1 comment, sorted by
top scoring
Click to highlight new comments since: Today at 9:04 PM
[-]StanislavKrym6h10

Can AI developers instill such preferences before it can write books? We can’t truly observe

As for observing preferences, we might be arguably able to infer them by asking the AI what it thinks of the book.[1] The AI will either tell the truth or lie, which we might hope to understand by using more complex methods, like reading the CoT. Which we are likely to lose with sufficiently capable architectures. I suspect that this is the reason why the AI-2027 forecast has Agent-2 end up mostly aligned and Agent-3 become misaligned at the same moment as the CoT is no longer readable. Alternatively, Agent-3 could end up thinking in a CoT that we fail to understand.

On the other hand, the AI might have been faking alignment the whole time and be so adept at it that even the CoT is powerless.[2] However, it's hard to tell this case apart from the case when the AI was aligned until something in the training environment misaligned the AI. Or from the case when the AI inevitably becomes misaligned once it is capable of commiting takeover.  

  1. ^

    Or have the AI watch as we play the game and consult the AI for advice.

  2. ^

    Alas, deliberately training a misaligned model to have the CoT look nice is likely to ensure that the model learns to be undetectably misaligned from OOMs less experience than SOTA models have. But we can infer the speed at which misaligned models learn to hide their misalignment from SOTA detection methods (e.g. ways to look into the AI's feelings and potential welfare or asking the model to express itself in images, not in text).  

Reply
Moderation Log
More from Alvin Ånestrand
View more
Curated and popular this week
1Comments
AI CapabilitiesAI EvaluationsForecasts (Specific Predictions)METR (org)AI
Frontpage

Are current AIs “aligned”?

When interacting with today’s most powerful AIs, they don’t appear very misaligned—they usually refuse to provide dangerous information, use a polite tone, and are generally very helpful (unless intentionally jailbroken or put in contrived setups). Some see this as a hopeful sign: maybe alignment isn’t that difficult?

Others are more skeptical, arguing that alignment results so far provide little evidence about how hard it will be to align AGI or superintelligence.

This appears to be a crux between AI optimists and pessimists—between risk deniers and doomers. If current methods work just as well for AGI, then perhaps things will be fine (aside from misuse, use in conflict, accident risks etc.). But if today’s methods are merely superficial tweaks—creating only the appearance of safety—AI developers may not notice their limitations until it’s too late.

Let’s build some intuition for how alignment differs depending on AI capability.

Consistency in Behavior

For long and complex tasks there is more room for interpretation and freedom in execution, compared to short tasks—like answering “yes” or “no” on a factual question. For example:

  • An AI that can only write a paragraph might reveal stylistic tendencies.
  • An AI that can write a page may reveal preferences in topic and structure.
  • An AI that can write a book might reveal preferences for narrative styles and character development.

You only observe certain tendencies, preferences, and behavioral consistencies once the AI is capable of expressing them.

Can the paragraph-limited AI have preferences for character-development? Can AI developers instill such preferences before it can write books? We can’t truly observe whether the developers succeeded until it’s able to write books.

Naturally, an AI developer might have some preference stylistic tendencies that the AI could be aligned or misaligned with. After succeeding with aligning the paragraph-limited AI, they congratulate themselves on their excellent alignment methods. But do they really apply to shaping preferences for character development?

Task Completion Time

Consider an AI that can complete tasks that take time X on average for a human to complete. When doing X-long tasks, what behavioral consistencies may be observed across those tasks?

Here’s a (very rough) sketch of what emerges at different task lengths[1]:

  • Minutes-long tasks: Completed directly, with little freedom in execution.
    • Key behaviors and skills: writing style, willingness to comply or refuse, communication tone, conceptual associations
  • Hours-long tasks: There is some freedom in the process for completing tasks.
    • Key behaviors and skills: reasoning, problem-solving, error correction, sensitivity to feedback and corrections
  • Days-long tasks: May require more complex operations, like division into subtasks and managing resources, many of which are only instrumentally useful to task completion.
    • Key behaviors and skills: action prioritization, resource management, sub-goal selection, maintaining relationships, learning, memory and knowledge organization, self-monitoring
  • Weeks-long tasks: Demand flexible high-level planning and ability to adapt, and often involve other complex skills like team coordination.
    • Key behaviors and skills: high-level decision making and strategizing, task delegation, adapting to setbacks and changing circumstances, balancing short-term vs. long-term tradeoffs, anticipating and dealing with risks and contingencies
  • Months-long tasks: Involve interpreting vague or open-ended goals (“make the world a better place”, “increase the yearly revenue for our company”). Practical considerations in strategy and prioritization at earlier stages transition towards philosophical considerations in what it means to make the world a “better place”.
    • Key behaviors and skills: moral reasoning, determination of terminal goals, making value tradeoffs, self-reflection and self-improvement

If you ask your favorite chatbot to draft a high-level plan for your business, it might give you something useful (if you’re lucky). You may be tempted to infer that it’s aligned if it was helpful. But I would argue that until the AI can complete weeks-long or months-long tasks, any consistencies in high-level planning are only superficial tendencies. It’s only when plans are tightly connected to task execution—when the AI can take an active role in carrying out the plans it provides—that consistencies become meaningful.

It’s at this point that an AI could, in principle, tell you one plan while pursuing another secret plan towards its own goals.

Current frontier systems may be aligned in the sense that they (usually) refuse to provide harmful information, use a nice tone, and their reasoning is human readable and accurately reflects their actual decision-making process. Their surface-level tendencies are indeed quite aligned!

But aligning AGI or superintelligence means aligning systems capable of pursuing arbitrarily long tasks, which involve extremely different behaviors and skills compared to what current systems are capable of.

This is why I think there is little empirical evidence for whether superintelligence alignment is easy or hard; we haven’t tried anything close to it.

Worse, AI labs might be lulled into a false sense of security by their apparent success at aligning weaker systems (let’s call this shallow alignment), mistaking these for progress on the different problem of aligning superintelligence (deep alignment)[2].

Alignment depth seems like a useful term to express things like “This alignment method is good but has relatively shallow alignment depth. It may break for tasks longer than ~2 hours”, or “These safety tests were designed for evaluating alignment at medium alignment depth (task length at ~1-7 days)”.

Methods for achieving deeper alignment are likely to be introduced as capabilities increase:

Progressing Alignment Techniques

AIs have been trained to be helpful, harmless and honest (HHH), according to a framework introduced in 2021 by Anthropic. These are alignment properties associated with minutes-long tasks (helpful - obey instructions, harmless - refuse harmful requests, honest - truthful communication), although you could argue that extensions of this framework may apply to longer tasks as well.

Analogically, OpenAI’s deliberative alignment training paradigm (introduced in December 2024), teaches AIs to reason about safety specifications before answering—an approach aimed at the process for task completion and aimed at hours-long tasks.

As capabilities advances, alignment techniques adapt. When AIs can complete days-long tasks, we might see new methods with names like “introspective alignment” (relying on self-monitoring), or “safe recollection” (instills appropriate memory-management behaviors to steer behavior and attention while completing long and complex tasks).

Alignment is often reactive, with new techniques introduced when necessary. And even if a good alignment method for weeks-long tasks would be introduced now, it might be neigh impossible to test it until the relevant capabilities arrive.

Timelines

The length of tasks that AIs can complete is called “time horizon”, a concept introduced by METR, which measures this property in the domain of software engineering. If an AI can complete tasks that typically take humans 1 hour, they have a 1-hour time horizon. This graph shows the 50% time horizon—the length of tasks that AIs can complete with 50% success rate:

Task duration for software engineering tasks that AIs can complete with 50% success rate—the 50% time horizon (source: METR)

Task duration for software engineering tasks that AIs can complete with 50% success rate (50% time horizon)

The dashed green line indicates the exponential trend, with a doubling time of ~7 months. Since 2024, however, the trend is closer to a doubling time of ~4 months. According to METR’s estimates, if progress continues at the faster pace, “1-month AIs” (capable of completing month-long tasks) would appear around 2027–2028, with “half the probability in 2027 and early 2028.” If, on the other hand, the trend reverts to the longer doubling time, the median timeline to 1-month AI is late 2029, with an 80% confidence interval width of ~2 years.

(To me it seems likely that the trend will speed up even further, for reasons I will discuss in a future post.)

It may sound pessimistic, but I think we’ll have to wait for 1-month AIs (and maybe longer) until we can expect somewhat reliable evidence about whether alignment methods truly hold, as month-long tasks may involve moral reasoning and value tradeoffs. At this point we would also have some data on how well alignment techniques generalize from shorter to longer tasks.

Asymmetrical Time Horizons

There is one potential source of hope: perhaps AI systems will reach long time horizons in safe domains before risky ones.

Let’s say that time horizon reaches months in, for instance, the domain of medical research doing so in concerning domains like AI R&D (critical for AI self-improvement) or politics (critical for gaining power).

The AI could be tasked to plan experiments, allocate resources, and make tradeoffs—all while handling uncertainty and ethical considerations.

Does it prioritize maximizing revenue? Saving lives? Gaining control over its host organization?

The behavior at this stage provides meaningful evidence (though still not conclusive) on whether we have truly figured out alignment or not—though it could of course be faking alignment.

Unfortunately, AI R&D is too strategically valuable for participants in the AI race. Developers are unlikely to evaluate alignment in safe domains before plunging forward. In fact, AI R&D might lead other domains in time horizon growth, because it’s prioritized.

Conclusion

Summarizing key predictions:

  • Current frameworks like helpful, harmless and honest (HHH) or deliberative alignment won’t be sufficient for longer tasks.
  • Alignment for AIs that can complete longer tasks is fundamentally different, and probably more difficult, than alignment of current frontier AIs.
  • New alignment methods will be introduced that roughly correspond to the length of tasks that AIs can complete.
  • AI companies are likely to be overconfident in their alignment methods, as everything appears to be fine for weaker AIs.

Hopefully, we can test alignment in safe domains before concerning capabilities arrive in risky ones—though AI R&D is likely to be prioritized.


Thank you for reading! If you found value in this post, consider subscribing!

  1. ^

    AIs are usually able to complete tasks much faster than humans, when they become able to complete them at all. When I refer to X-long tasks (minutes-long, hours-long, etc.), I refer to tasks that take humans X time to complete on average. I’m not entirely sure how this impacts key skills and behaviors for different task lengths, as this might be different for AIs compared to humans. Again, this is a very rough sketch.

  2. ^

    I remain uncertain how different superintelligence alignment is from alignment of less sophisticated AIs. Current methods seem heavily dependent on providing good training signals that strengthen preferrable behaviors, which may not be possible for superintelligent systems operating at levels where humans can't keep up.