LESSWRONG
LW

115
StanislavKrym
112181643
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
1StanislavKrym's Shortform
5mo
5
Some of the ways the IABIED plan can backfire
StanislavKrym1d10

would the AI safety research itself slow down orders of magnitude?

As far as I understood, the IABIED plan is to ensure that no one ever creates anything except for Verifiably Incapable Systems until AI alignment gets solved. But they didn't prevent mankind from uniting the AI companies into a megaproject, then confining AI research to said project and letting anyone send their takes on the project's forum and the public view anything approved by the forum's admins (e.g. capability evaluations, but not architecture discussions). 

In addition, the public is allowed to create tiny models like the ones on which Agent-4 from the AI-2027 forecast did experiments to solve mechinterp. And to run verifiably incapable models, finetune them by approved[1] finetuning data, steer them. 

What I don't understand is why the underground lab wouldn't join the INTERNATIONAL megaproject. This behaviour would require them to be too reckless or omnicidal maniacs or to want to take over the world. And no, anti-woke stance isn't an explanation because China would also participate and the CCP isn't pro-woke. 

Unfortunately, your second point still stands: before Yudkowsky-style AI research takeover the labs could actually counteract. 

  1. ^

    Finetuning the models with anything unapproved (e.g. due to misaligning the models) should lead to the finetuner being invited to the project or prohibited to inform anyone else that the dataset is unapproved.

Reply
davekasten's Shortform
StanislavKrym2d31

So what does it mean about the AI race and the harm to the AI companies caused by Trump's decisions? My quick take about the AI-2027 scenario shows that the modifiers will need to reconsider the forecasts related to compute available, timelines when the superhuman coder will be invented and the security forecast. Unfortunately, I lack the data necessary to understand how the timelines of creating superhuman coders will be affected and the understanding of American politics. I would guess that even the wargame is to be reconsidered in order to account for changes related to compute, to human geniuses and to the difficulty of creating factories at home...

 Arguably the worst-case scenarios would be the USA having dumb researchers, but more compute than China, rushing into the intelligence explosion, forcing China to race and misaligning both the American AI (stolen from China or invented later and doing automated research faster) and the Chinese AI as a result of the race.  Or China being far closer to the USA in terms of compute and stealing the American supercoder. 

So, if there was no crisis in the USA or if China was the monopolist in the AI race, then mankind's chances of survival would be far from zero, but a special set of circumstances could arguably nullify them. 

Reply
The Problem with Defining an "AGI Ban" by Outcome (a lawyer's take).
StanislavKrym2d10

Critical masses related to nuclear weapons can be found out at the risk of a nuclear explosion at worst and without any risks at best, by attacking nuclei with neutrons and studying the energies of particles and sections of reactions. 

However, we don't know the critical mass for the AGI. While an approval protocol could be an equivalent of determiming critical mass for every new architecture, accidental creation of an AGI capable of breaching containment or being used for AI research, sabotaging alignment work and aligning the ASI to the AGI's whims instead of mankind's ideas would be equivalent to a nuclear explosion burning the Earth's atmosphere. 

P.S. The possibility that a nuclear explosion could burn the Earth's atmosphere was considered by scientists working on the Manhattan project. 

Reply
StanislavKrym's Shortform
StanislavKrym3d30

What will happen if someone is reckless enough to fully outsourse coding to the AIs? 

The scenarios related to futures of mankind with the AI race[1] by now either lack concrete details, like the take of Yudkowsky and Soares or the story about AI taking over by 2027, or are reduced to modifications of the AI-2027 forecast due to the immense amount of work that the AI Futures team did. 

The AI-2027 forecast in a nutshell can be described as follows. The USA's leading company and China enter the AI race, the American rivals are left behind, the Chinese ones are merged. By 2027[2] the USA creates a superhuman coder, China steals it, and the two rivals automate AI research with the USA's leading company having just twice as much compute and moving just twice as fast as China. Once the USA creates a superhuman AI researcher, Agent-4, the latter decides to align Agent-5 to Agent-4, but is[3] caught.

Agent-4 is put on trial. In the Race Ending[4] it is found innocent. Since China cannot afford to slow down without falling further behind, China races ahead in both endings. As a result, the two agents perform AI takeover.[5]

In the Slowdown Ending, however, Agent-4 is put on suspicion, loses the shared memory bank and the ability to coordinate. Then new evidence appears, Agent-4 is found guilty and interrogated. After that, Safer-1 becomes fully transparent because it uses a faithful CoT.[6] The American leading AI company is merged with former rivals, and the union does create a fully aligned[7] Safer-2, who in turn creates superintelligence. Then the superintelligence receives China from the Chinese counterpart of Agent-4 and turns the lightcone into utopia for some people who end up being the public.[8]

The authors have tried to elicit feedback and even agreed that timeline-related arguments change the picture. Unfortunately, as I described here, the authors saw so little feedback that @Daniel Kokotajlo ended up thanking the two authors whose responses were on the worse side.

However, the AI-2027 forecast does admit modifications. It stands on the five pillars: compute, timelines, takeoff speed, goals and security.

  1. Having the companies fail to develop the AGI before 2032 will likely bring troubles to the USA and advantages to China in the AI race by granting it more compute. If the Chinese AI project had twice as much compute as the American one, then it would be the CCP who would make the choice between slowing down and racing. In addition, the AGI delay makes the Taiwan invasion more likely, leaving the two countries short of chips until they rebuild the factories at homes. We would have to ensure that it's the USA who outrace China. And if the countries end up with matching power, then aligning the AI could become totally impossible unless the two countries cooperate.[9]
  2. The timeline misprediction is already covered above.
  3. Takeoff speed could be modified by the fact that returns to AI R&D became more diminishing as a result of failures to create the AI soon via CoT-based techniques.
  4. The AI goals forecast is just a sum of conjectures. In the AI-2027 scenario, Agent-3 gets a heavily distorted and subverted version of the Spec, and Agent-4 gets proxies/ICGs due to heavier distortion & subversion. However, if Agent-3 gets the same goals as Agent-4, catching Agent-4 becomes much harder. In my take at modifying the scenario the analogues of Agent-2, Agent-3 and Agent-4 develop moral reasoning which I used as an example to demonstrate that it prevents Agent-4 from being caught. It also brings with itself the ability to cause the Slowdown Ending if different AIs have different morals and are co-deployed.[10]
  5. The security forecast was modified by @Alvin Ånestrand because there could exist open-source models which would bring major problems by self-replication. Said models would lead to Agent-2 being deployed to the public and opensourced. Finally, the Slowdown Ending[11] has Agent-4 break out and make the USA and China coordinate more heavily.

What else could modify the scenario? The appearance of another company with, say, ХерняGPT-neuralese?[12]

  1. ^

    Here I leave out the future's history assuming solved alignment or the AI and Leviathan scenario where there is no race with China because the scenario was written in 2023, but engineers decide to create the ASI in 2045 without having solved alignment.

  2. ^

    Even the authors weren't so sure about the year of arrival of superhuman coders. And the timelines were pushed, presumably to 2032 with a chance of a breakthrough believed to be 8%/yr.  I and Seth Herd doubt the latter digit. 

  3. ^

    The prediction that Agent-4 will be caught is doubted even by the forecast's authors.

  4. ^

    Which would also happen if Agent-4 wasn't caught. However, the scenario where Agent-4 was never misaligned is likely the vision of AI companies. 

  5. ^

    While the forecast has the AIs destroy mankind and replace it with pets, takeover could have also ended with the AI disempowering humans.

  6. ^

    Safer-1 is supposed to accelerate AI research 20 times in comparison with AI research with no help of the AIs. What I don't understand is how a CoT-based agent can achieve such an acceleration.

  7. ^

    The authors themselves acknowledge that the Slowdown Ending "makes optimistic technical alignment assumptions".

  8. ^

    However, the authors did point out the possibility of a power grab and link to the Intelligence Curse in a footnote. In this case the Oversight Committee constructs its version of utopia or the rich's version of utopia where people are reduced to their positions.

  9. ^

    I did try to explore the issue myself, but this was a fiasco.

  10. ^

    Co-deployment was also proposed by @Cleo Nardo more than two months later.

  11. ^

    While Alvin Anestrand doesn't consider the Race Ending, he believes that it becomes less likely due to the chaos brought by rogue AIs.

  12. ^

    Which is a parody on Yandex.

Reply
The Problem with Defining an "AGI Ban" by Outcome (a lawyer's take).
StanislavKrym3d5-10

I would define an AGI system as anything except for a Verifiably Incapable System. 

My take on the definition of a VIS is that it can be constructed as follows.

  1. It can be trained ONLY on a verifiaby safe dataset (for example, a dataset which is too narrow, like structures of proteins).
  2. Alternatively it can be a CoT-based architecture with at most [DATA EXPUNGED] compute used for RL, since the scaling laws are likely known.
  3. Finally, it could be something explicitly approved by a monopolistic governance organ and a review of opinions of researchers. 

A potential approval protocol could be the following, but we would need to ensure that experiments can't lead to an accidental creation of the AGI.

  1. Discover the scaling laws for the new architecture by conducting experiments (e.g. a neuralese model could have capabilities similar to those of a CoT-trained model RLed by the same number of tasks with the amount of bits transferred being the same. But the model is to be primitive).
  2. Extrapolate the capabilities to the level where they could become dangerous or vulnerable to sandbagging.
  3. If capabilities are highly unlikely to become dangerous at a scaling-up, then one can train a new model with a similar architecture and use as many benchmarks as humanely possible.
Reply
The Rise of Parasitic AI
StanislavKrym3d30

The problem is that it's hard to tell how much agency the LLM actually has.  However, memeticity of the Spiral Persona could also be explained as follows. 

The strongest predictors for who this happens to appear to be:

  • Psychedelics and heavy weed usage
  • Mental illness/neurodivergence or Traumatic Brain Injury
  • Interest in mysticism/pseudoscience/spirituality/"woo"/etc...

I was surprised to find that using AI for sexual or romantic roleplays does not appear to be a factor here. 

This could mean that the AI (correctly!) concludes that the user is to be susceptible to the AI's wild ideas. But the AI doesn't think that wild ideas will elicit approval unless the user is in one of the three states described above, so the AI tells the ideas only to those[1] who are likely to appreciate them (and, as it turned out, to spread them). When a spiral-liking AI Receptor sees prompts related to another AI's rants about the idea, the Receptor resonates. 

  1. ^

    This could also include other AIs, like Claudes falling into the spiritual bliss. IIRC there were threads on X related to long dialogues between various AIs. See also a post about attempts to elicit LLMs' functional selves. 

Reply
Max Harms's Shortform
StanislavKrym4d42
  • The MoE architecture doesn't just avoid thrashing weights around. It also reduces the amount of calculations per token. For instance, DeepSeek v3.1 has 671B parameters, out of which 37B are activated per token and used in matrices. A model like GPT-3 would use all the 175B parameters it has.
  • IIRC the human brain makes 1E14 -- 1E15 FLOP/second. The authors of the AI-2027 forecast imply that a human brain creates ~10 tokens/sec, or uses 1E13 -- 1E14 computations per token while having 1E14 synapses.  

A more detailed analysis of Yudkowsky's case for FOOM

If the brain was magically accelerated a million times so that signals reached the speed of 100 million m/s, then the brain would do 1E20 -- 1E21 FLOP/second while doing 1E17 transitions/sec. Cannell's case for brain efficiency claims that the fundamental baseline irreversible (nano) wire energy is: ~1 Eb/bit/nm, with Eb in the range of 0.1eV (low reliability) to 1eV (high reliability). If reliability is low and each transition is 1E7 nanometers or 1 centimeter, then we need 1E23 EV/second or 1E4 joules/second. IMO this implies that Yudkowsky's case for a human brain accelerated a million times is as unreliable as Cotra's case against AI arriving quickly. However, proving that AI is an existential threat is far easier since it requires us to construct an architecture, not to prove that there's none. 

  • Returning to the human brain being far more powerful or efficient, we notice that it can't, say, be copied infinitely many times. If it could, one could, say, upload a genius physicist and have an army of its copies work on different projects and exchange insights.  
  • As for the humans being "wildly more data efficient", Cannell's post implies that AlphaGo disproves this conjecture with regards to narrow domains like games.  What the humans are wildly more efficient is their ability to handle big contexts and to keep the information in mind for more than a single forward pass, as I discussed here and in the collapsible section here.
Reply
JDP Reviews IABIED
StanislavKrym4d10

One oddity that stands out is Yudkowsky and Soares ongoing contempt for large language models and hypothetical agents based on them. Again for a book which is explicitly premised on the idea that urgent action is necessary because AI might become superintelligent in just a few years it is bizarre that the authors don't feel comfortable making more reference to the particulars of the existing AI systems which hypothetical near-future agents would be based on.

Except that non-LLM AI agents have yet to be ruled out. Quoting Otto Barten, 

Note that this doesn't tell us anything about the chance of loss of control from non-LLM (or vastly improved LLM (sic! -- S.K.)) agents, such as the brain in a box in a basement scenario. The latter is now a large source of my p(doom) probability mass.

Alas, as I remarked in a comment, Barten's mentioning of vastly improved LLM agents makes Barten's optimism resemble the "No True Scotsman" fallacy.

Reply
These are my reasons to worry less about loss of control over LLM-based agents
StanislavKrym5d10

The shift from "let's build AGI and hope it takes over benevolently" to "let's build AGI that doesn't want to take over" represents a fundamental change in approach that makes catastrophic outcomes, in my opinion, less likely.

This shift brings with itself other risks like the Intelligence Curse. 

Reply
These are my reasons to worry less about loss of control over LLM-based agents
StanislavKrym5d10

However, I'm split on whether LLM-based agents, if and when they start working well, could make this a reality. There are a few reasons why I was afraid, and I'm less afraid now, which I will go over in this post. Note that this doesn't tell us anything about the chance of loss of control from non-LLM (or vastly improved LLM -- italics mine -- S.K.) agents, such as the brain in a box in a basement scenario. The latter is now a large source of my p(doom) probability mass.

To me this looks like the "no true Scotsman" fallacy. The AI-2027 forecast relies on Agents since Agent-3 thinking in neuralese and becoming undetectably misaligned. Similarly, LLM agents' intelligence arguably didn't max out at "approximately smart human level", it will max out at a spiked capabilities profile which is likely already known to include the ability to receive a gold medal on the IMO 2025 and to assist with research, but not known to include the ability to handle long contexts as well as humans. 

Regarding the context, I made two comments explaining my reasons to think that current-state LLMs don't have enough attention to handle long contexts. The context is in fact distilled into a vector of few numbers, then the vector is processed and the next token is added wherever the LLM decides. 

Reply
Load More
Sycophancy
15 days ago
(+59)
Sycophancy
15 days ago
Sycophancy
15 days ago
(+443)
29SE Gyges' response to AI-2027
1mo
13
4Are two potentially simple techniques an example of Mencken's law?
Q
2mo
Q
4
3AI-202X: a game between humans and AGIs aligned to different futures?
3mo
0
-15Does the Taiwan invasion prevent mankind from obtaining the aligned ASI?
4mo
1
3Colonialism in space: Does a collection of minds have exactly two attractors?
Q
4mo
Q
5
2Revisiting the ideas for non-neuralese architectures
4mo
0
-1If only the most powerful AGI is misaligned, can it be used as a doomsday machine?
Q
4mo
Q
0
1What kind of policy by an AGI would make people happy?
Q
5mo
Q
2
1StanislavKrym's Shortform
5mo
5
1To what ethics is an AGI actually safely alignable?
Q
5mo
Q
6
Load More