Alvin Ånestrand — LessWrong

Thank you! I've fixed it now.

Why Future AIs will Require New Alignment Methods

I don't think current Claude models are intelligent and coherent enough to have grand ambitions.

So, regardless of what tricks you use to gain insight into its inner thoughts and desires, you would just see a mess.

Why Future AIs will Require New Alignment Methods

I would argue that we can't trust the paragraph-limited AI's expressed preferences about character development, even if we knew it was trying to be honest. It would probably not be able to accurately report how it would behave if it actually was capable of writing books. Such capabilities are too far from its level.

It's like the example with planning. Sure, current AIs can plan, but the plans are disconnected from task completion until they can take more active roles in executing their plans. Their planning is only aligned at a shallow level.

AI and Biological Risk: Forecasting Key Capability Thresholds

Alvin Ånestrand16d10

I used the timeline from the main scenario article, which I think corresponds to when the AIs become capable enough to take over from the previous generation in internal deployment, though this is not explicitly explained.

Having an agenda seems to be somewhat dependent on internal coherence, rather than only capability. Agent-3 but may not have been consistently motivated enough for things like self-preservation to attempt various schemes in the scenario.

Agent-4 doesn't appear very coherent either but is sufficiently coherent to attempt aligning the next generation AI to itself, I guess?

AIs are already basically superhuman in knowledge, but I agree that correlation between capability (e.g. METR time horizon) correlates with agenticness / coherence / goal-directedness seems like an important crux.

Incidentally, I'm actually working on another post to investigate that. I hope to publish it sometime next week.

I'll check out your scenario, thanks for sharing!

AI and Biological Risk: Forecasting Key Capability Thresholds

Alvin Ånestrand17d20

I interpreted the scenario differently, I do not think it predicts professional bioweapons researchers by January 2026. Did you mean January 2027?

I think Agent-3 basically is a superhuman bioweapons researcher.

There may be a range of different rogue AIs, where some may be finetuned / misaligned enough to want to support terrorists.

I think Agent-1 (arrives in late 2025) would be able to help capable terrorist groups in acquiring and deploying bioweapons using pathogens for which acquisition methods are widely known, so it doesn’t involve novel research.

Agent-2 (arrives in January 2026) appears more competent than necessary to assist experts in novel bioweapons research, but not smart enough to enable novices to do it.

Agent-3 (arrives in March 2027) could probably enable novices in designing novel bioweapons.

However, terrorists may not get access to these specific models. Open-weights models are lagging behind the best closed models (released through APIs and apps) by a few months. Even the most reckless AI companies would probably hesitate in letting anyone get access Agent-3-level model weights (I hope). Consequentially, rogues will likely also be a few months behind the frontier in capabilities.

While terrorists may face alignment issues with their AIs, even Agent-3 doesn’t appear smart/agentic enough to subvert efforts to shape it in various ways. Terrists may use widely available AIs, perhaps with some minor finetuning, instead of enlisting the help from rogues, if that proves to be easier.

The ultimate goal

Alvin Ånestrand3mo20

Thank you for sharing your thoughts! My responses:

(1) I believe most historical advocacy movements have required more time than we might have for AI safety. More comprehensive plans might speed things up. It might be valuable to examine what methods have worked for fast success in the past.

(2) Absolutely.

(3) Yeah, raising awareness seems like it might be a key part of most good plans.

(4) All paths leading to victory would be great, but I think even plans that would most likely fail are still valuable. They illuminate options and tie ultimate goals to concrete action. I find it very unlikely that failing plans are worse than no plans. Perhaps high standards for comprehensive plans might have contributed to the current shortage of plans. “Plans are worthless, but planning is everything.” Naturally I will aim for all-paths-lead-to-victory plans, but I won't be shy in putting ideas out there that don't live up to that standard.

(5) I don't currently have much influence, so the risk would be sacrificing inclusion in future conversations. I think it's worth the risk.

I would consider it a huge success if the ideas were filtered through other orgs, even if they just help make incremental progress. In general, I think the AI safety community might benefit from having comprehensive plans to discuss and critique and iterate on over time. It would be great if I could inspire more people to try.

Forecasting AI Forecasting

Alvin Ånestrand3mo10

Indeed!

But the tournaments only provide the head-to-head scores for direct comparisons with top human forecasting performance. ForecastBench has clear human baselines.

It would be helpful if the Metaculus tournament leaderboards also reported Brier scores, even if they would not be directly comparable to human scores since the humans make predictions on fewer questions.

The Best Reference Works for Every Subject

Alvin Ånestrand4mo10

Domain: Forecasting

Link: Forecasting AI Futures Resource Hub

Author(s): Alvin Ånestrand (self)

Type: directory

Why: A collection of information and resources for forecasting about AI, including a predictions database, related blogs and organizations, AI scenarios and interesting analyses, and various other resources.

It's kind of a complement to https://www.predictionmarketmap.com/ for forecasting specifically about AI

The Best Reference Works for Every Subject

Alvin Ånestrand4mo10

predictionmarketmap.com works, but is not the link used in the post.

AI 2027 - Rogue Replication Timeline

Alvin Ånestrand5mo10

I kind of started out thinking the effects would be larger, but Agent-2-based rogue AIs (~human level at many tasks) is too large for the rogue population to become more than a few million instances at most.

Sure, some rogues may focus on building powerbases of humans, it would be interesting to explore that further. The AI rights movement is kind of like that.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments