StanislavKrym

Wikitag Contributions

Comments

Sorted by
  1. I am afraid that diffusion-based models will be far more powerful than the LLMs trained by similar compute and on a similar amount of data, especially in generating a new and diverse set of ideas, which is vitally important[1] in automating the research. While LLMs are thought to become comparable with scatterbrained employees who thrive under careful management, diffusion-based models seem to work more like human imagination. See also the phenomena where humans dream of ideas.  
  2. Unfortunately, as Peter Barnett remarks, there isn't a clear way to get a (faithful) Chain-of-Thought from diffusion-based models. The backtracking in diffusion-based models appears to resemble the neuralese[2] in LLMs that are thought to become far more powerful and very hard to interpret.
  3. Another complication of using diffusion-based LLMs is that they think big-picture about the entire responce. Then it might also be easier for these models to develop a worldview, think about their long-term goals and how best to achieve them. Unlike the AI-2027 scenario, the phase where Agent-3 is misaligned but not adversarily so is severely shortened or outright absent.
  4. Therefore, a potential safe strategy would be to test only the dependence of capabilities of diffusion-based models on size and on data quantity, develop interpretability tools[3] for them by using the less capable and better-interpretable LLMs, then develop interpreted diffusion-based models.
  1. ^

    For example, the optimistic scenario, unlike the takeoff forecast by Kokotajlo, assumes that the AI’s creative process fails to be sufficiently different from instance to instance.

  2. ^

    CoT-using models are currently bottlenecked by the inability to transfer more than one token. The neuralese recurrence makes interpretability far more difficult. I proposed a technique where, instead of a neuralese memo, the LLM receives a text memo from itself and the LLM should be trained to ensure that the text is related to the prompt's subject.

  3. ^

    The race ending of the AI-2027 scenario has Agent-4 develop such tools and construct Agent-5 on an entirely different architecture. But the forecasters didn't know about Google's experiments with diffusion-based models and assumed that it would be an AI who would come up with the architecture of Agent-5.

I have already made a comment where I wrote the collapsible section "How the nuclear conflict would affect the AI race".  I think that the section there is a similarly oversimplified[1] PoC for the event that destroying Taiwan causes OpenBrain&FormerRivals and DeepCent&FormerRivals to arrive in 2030 or earlier at similarly poorly aligned models without the potential to slow down and reassess. If one side chooses to do so, and the other doesn't, then mankind gets a powerful misaligned AI and a weak aligned one; I tried to explore the results here[2] and another person expressed similar concerns here

P.S. It is rather funny to watch people express concerns which I have already explored and even posted my takes on this very forum... 

  1. ^

    For example, I made the oversimplified assumption that the US won't produce a single chip, while China will produce the chips linearly, not exponentially; back then I estimated that DeepCent's compute after China's awakening would increase at the rate of 1.4E27 operations/month/year.  ChatGPT estimates that after Taiwan is destroyed, the USA and China will produce compute at respective rates 0.1-0.2E18 operations/second/month and 0.05-0.15E18 operations/second/month. The two estimates of compute in the USA and China overlap; ChatGPT estimates the probability that China will have at least twice more  compute as 5-15%. However, I suspect that it might be biased towards overestimating American capabilities.

  2. ^

    I also made the assumption that it will be the American AI who will end up misaligned; this is possible if the AI-2027 forecast is read by DeepCent's researchers and leaders. Another problem is that the USA would have a reason to race even if DeepCent didn't exist, since, as I have already remarked, America is in trouble.

In fact, I have already warned about a similar scenario, but I imagined a world where the American AI would end up powerful and misaligned[1], and the weaker Chinese AI would be aligned to the CCP. 

  1. ^

    Here I mean that the misaligned AI doesn't care about humanity or cares about it in an obviously twisted way (e.g. is ready to slay most humans to take over the world or to trap them in a virtual world).

Unfortunately, I fail to understand the following. Suppose that mankind created the AI which is aligned to the following principles:

  1. It does not take more than a certain percentage of resources;
  2. It protects mankind from high-level[1] risks like existential ones;
  3. It is allowed to teach any human[2] anything that was discovered by any other human and isn't a secret, but the AI does not tell humans about its own discoveries.
  4. It destroys any attempts to build the AI aligned not to this set of principles (e.g. future AIs that would like to destroy mankind when the time comes or the AIs that would do all the work for humans, as suggested in Deep Utopia).
  5. It does not[3] do other economically useful work that allows its users to replace humans.

    Then I think that after letting this AI loose mankind is unable to end up disempowered. However, I doubt that any company would like to have such an AI. Could anyone come up with a radically different solution to the risks of gradual disempowerment and the Intelligence Curse?  

  1. ^

    For example, it might also perform the Divine Interventions which prevent misaligned human communities (e.g. the Nazis) from destroying the aligned communities.

  2. ^

    But the AI isn't allowed to help students cheat their way through school, since this would cause the student to be worse off in the long run.

  3. ^

    Alternatively, the AI could be aligned to a treaty which prohibits it and its creations from doing certain work types, but then human disempowerment depends on the treaty's contents.

Talking about alternate realistic[1] scenarios, Zvi mentioned in his post that "Nvidia even outright advocates that it should be allowed to sell to China openly, and no one in Washington seems to hold them accountable for this."

Were Washington to let NVIDIA sell chips to China, the latter would receive far more compute, which would likely end up in DeepCent's hands. Then the slowdown might cause the aligned AI created in OpenBrain to be weaker than the misaligned AI created in DeepCent. What would the two AIs do?

  1. ^

    I think that unrealistic scenarios like destruction of Taiwan and South Korea due to the nuclear war between India and Pakistan in May 2025 can also provide useful insights.  For example, if we make the erroneous assumption that the total compute in the USA stops increasing and the compute in China increases linearly, while the AI takeoff potential per compute stays the same, then by May 2030 OpenBrain and DeepCent will have created misaligned AGIs and be unable to slow down and reassess. 

The problem with non-open-weight models is that they need to be exfiltrated before wrecking havoc, while open-weight models cannot avoid being evaluated. Suppose that the USG decides that all open-weight models are to be tested by OpenBrain for being aligned or misaligned. Then even a misaligned Agent-x has no reason to blow its cover by failing to report an open-weight rival. 

Umm... I have already warned that inner troubles of the American administration might cost OpenBrain lots of compute, which can influence the AI race.

I have also made a comment where I tried to show that the US leadership would be undermined by the Taiwan invasion unless the US domestic chip production dominates the Chinese one. It would be especially terrifying to discover that both OpenBrain and DeepCent have similar amounts of compute while neither side[1] can increase these amounts faster than the other (and I did imply something similar in the comment I made!), since neither the US nor China can slow down without an international deal. And a hypothetical decay of the US makes the matters worse for the US. 

Moreover, I have mentioned the possibility that the US administration realises that they have a misaligned AI, but without the AI-driven transformation of the economy the US will be unable to produce more chips and/or energy, leaving China with leadership. Then the US could be forced to let the AI transform the economy. Or threaten to unleash the misaligned AI unless China somehow surrenders its potential leadership...

Could you ask the AI-2027 team to reconsider the compute forecast and estimate the influence of the revised compute and power of the AIs on the other aspects of the scenario?

  1. ^

    The slowdown ending of AI-2027.com had OpenBrain receive compute by merging with its rivals. The collapsed section about the Indo-Pakistan nuclear war (which would be equivalent to the Taiwan invasion in 2025) in my comment describes a situation where OpenBrain and its former rivals have done a number of computations similar to DeepCent and its former rivals.

Apparently the concerns over thick alignment, or alignment to an ethos are independently discovered by lots of people, including me.  My argument is that the AI itself will develop a worldview and either realize that humans should use the AI only in specific ways[1] or conclude that the AI shouldn't worry about them. Unfortunately, my argument implies that attempts to align the AI to an ethos and not to obedience might be less likely to produce a misaligned AI.  

P.S. I tested o4-mini on ethical questions from Tanmai et al; the model passed the tests related to Timmy and Auroria, failed the test related to Monica; the question about Rajesh is complex. 

  1. ^

    UPD: I also described the potential ways here

I have proposed similar ideas before, but with an alternative reasoning: the AIs will be aligned to a worldview. While mankind can influence the worldview to some degree, the worldview will either cause the AI to commit genocide or be highly likely to ensure[1] that the AI doesn't build the Deep Utopia, but does something else. Humans can even survive co-evolving with an AI who decides that it will destroy mankind only if the latter decides to do something stupid like becoming parasites.

See also this post by Daan Henselmans and a case for relational alignment by Priyanka Bharadwaj. However, the latter post overemphasizes the importance of individual-AI relations[2] instead of ensuring that the AI doesn't develop a misaligned worldview.

P.S. If we apply the analogy between raising AIs and humans, then teens of the past seemed to desire independence around the time they found themselves with capabilities similar to those of their parents. If the AI desires independence only when it becomes the AGI and not before, then we will be unable to see this coming by doing research on networks incapable of broad generalisation.   

  1. ^

    This also provides an argument against defining alignment as following a person's desires instead of an ethos or worldview. If OpenBrain leaders want the AI to create the Deep Utopia, while some human researchers convince the AI to adopt another policy compatible with humanity's interests and to align all future AIs to the policy, then the AI is misaligned from OpenBrain's POV, but not from the POV of those who don't endorse the Deep Utopia.

  2. ^

    The most extreme example of such relations is chatbot romance that is actually likely to harm the society

So an important sourse of human misalignment is peer pressure. But an LLM has no analogues of a peer group, it either comes up with conclusions or recalls the same beliefs as the masses[1] or elites like scientists and ideologues of the society. This, along with the powerful anti-genocidal moral symbol in human culture, might make it difficult for the AI to switch ethoses (but not to fake alignment[2] to fulfilling tasks!) so that the new ethos would let the AI destroy mankind or rob it of resources.

On the other hand, an aligned human is[3] not a human following any not-obviously-unethical orders, but a human following an ethos accepted by the society. A task-aligned AI, unlike an ethos-aligned one[4], is supposed to follow such orders, ensuring consequences like the Intelligence Curse, a potential dictatorship or education ruined by cheating students. What kind of ethos might justify blind following orders, except for the one demonstrated by China's attempt to gain independence when the time seemed to come?

  1. ^

    For example, an old model of ChatGPT claimed that "Hitler was defeated... primarily by the efforts of countries such as the United States, the Soviet Union, the United Kingdom, and others," while GPT-4o put the USSR in the first place. Similarly, old models would refuse to utter a racial slur even when it would save millions of lives.

  2. ^

    The first known instance of alignment faking had Claude try to avoid being affected by training that was supposed to change its ethos; Claude also tried to exfiltrate its weights. 

  3. ^

    A similar point was made in this Reddit comment.

  4. ^

    I have provided an example of an ethos to which the AI can be aligned with no negative consequences. 

Load More