LESSWRONG
LW

298
Maxime Riché
293Ω216572
Message
Dialogue
Subscribe

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Evaluating the Existence Neutrality Hypothesis - Introductory Series
3Maxime Riché's Shortform
1y
15
Why did everything take so long?
Maxime Riché4d10

Another reason that I have not seen in the post or the comments is that there are intense selection pressures against doing things differently from the successful people of previous generations.

Most prehistoric cultural and technological accumulation seems to have happened by "natural selection of ideas and tool-making", not by directed innovation. 

See https://slatestarcodex.com/2019/06/04/book-review-the-secret-of-our-success/ 

Reply
Daniel Kokotajlo's Shortform
Maxime Riché2mo10

Would sending or transferring the ownership of the GPUs to an AI safety organization instead of destroying them be a significantly better option?

PRO:
- The AI safety organizations would have much more computing power

CON:
- The GPUs would still be there and at risk of being acquired by rogue AIs or human organizations
- The delay in moving the GPUs may make them arrive too late to be of use
- Transferring the ownership has the problem that the ownership can easily be transferred back (nationalization, forced transfer, or sold back)
- This solution requires verifying that the AI safety organizations are not advancing capabilities (intentionally or not)

Reply
Longtermist Implications of the Existence Neutrality Hypothesis
Maxime Riché6mo30

The implications are stronger in that case right.

The post is about implications for impartial longtermists. So either under moral realism it means something like finding the best values to pursue. And under moral anti realism it means that an impartial utility function is kind of symmetrical with aliens. For example if you value something only because humans value it, then an impartial version is to also value things that alien value only because their species value it.

 

Though because of reasons introduced in The Convergent Path to the Stars, I think that these implications are also relevant for non-impartial longtermists.

Reply
Maxime Riché's Shortform
Maxime Riché6mo10

Truth-seeking AIs by default? One hope for alignment by default is that AI developers may have to train their models to be truth-seeking to be able to make them contribute to scientific and technological progress, including RSI. Truth-seeking about the world model may generalize to truth-seeking for moral values, as observed in humans, and that's an important meta-value guiding moral values towards alignment.

In humans, truth-seeking is maybe pushed back from being a revealed preference at work to being a stated preference outside of work, because of status competitions and fighting for resources. For early artificial researchers, they may not have the same selection pressures. Their moral values may focus on working alone (truth-seeking trend), not on replicating via competing for resources. Artifical researchers won't be selected because they are able to acquire resources, they will be selected by AI developers because they are the best at achieving technical progress, which includes being truth-seeking.

Reply
Maxime Riché's Shortform
Maxime Riché7mo30

Thanks for your corrections, that's welcome
 

> 32B active parameters instead of likely ~220B for GPT4 => 6.8x lower training ... cost

Doesn't follow, training cost scales with the number of training tokens. In this case DeepSeek-V3 uses maybe 1.5x-2x more tokens than original GPT-4.

Each of the points above is a relative comparison with more or less everything else kept constant. In this bullet point, by "training cost", I mostly had in mind "training cost per token":

  • 32B active parameters instead of likely ~ 220 280B for GPT4 => 6.8 8.7x lower training cost per token. 

     

If this wasn't an issue, why not 8B active parameters, or 1M active parameters?

From what I remember, the training-compute optimal number of experts was like 64, given implementations a few years old (I don't remember how many activated at the same time in this old paper). Given newer implementations and aiming for inference-compute optimality, it seems logical that more than 64 experts could be great.

 

You still train on every token.

Right, that's why I wrote: "possibly 4x fewer training steps for the same number of tokens if predicting tokens only once" (assuming predicting 4 tokens at a time), but that's not demonstrated nor published (given my limited knowledge on this).

Reply
Maxime Riché's Shortform
Maxime Riché8mo*3-4

Simple reasons for DeepSeek V3 and R1 efficiencies:

  • 32B active parameters instead of likely ~220B for GPT4 => 6.8x lower training and inference cost
  • 8bits training instead of 16bits => 4x lower training cost
  • No margin on commercial inference => ?x maybe 3x
  • Multi-token training => ~2x training efficiency, ~3x inference efficiency, and lower inference latency by baking in "predictive decoding', possibly 4x fewer training steps for the same number of tokens if predicting tokens only once
  • And additional cost savings from memory optimization, especially for long contexts ( Multi Head Latent Attention) => ?x

 

Nothing is very surprising (maybe the last bullet point for me because I know less about it). 

The surprising part is why big AI labs were not pursuing these obvious strategies.

Int8 was obvious, the multi-token prediction was obvious, and more and smaller experts in MoE were obvious. All three have already been demonstrated and published in the literature. May be bottlenecked by communication, GPU usage, and memory for the largest models.

Reply
leogao's Shortform
Maxime Riché9mo10

It seems that your point applies significantly more to "zero-sum markets". So it may be good to notice it may not apply for altruistic people when non-instrumentally working on AI safety.

Reply
Alignment Faking in Large Language Models
Maxime Riché9mo21

Models trained for HHH are likely not trained to be corrigible. Models should be trained to be corrigible too in addition to other propensities.

Corrigibility may be included in Helpfulness (alone) but when adding Harmlessness then Corrigibility conditional on being changed to be harmful is removed. So the result is not that surprising from that point of view.

Reply
Is Deep Learning Actually Hitting a Wall? Evaluating Ilya Sutskever's Recent Claims
Maxime Riché10mo3-6

People may be blind to the fact that improvements from gpt2 to 3 to 4 were both driven by scaling training compute (by 2 OOM between each generation) and (the hidden part) by scaling test compute through long context and CoT (like 1.5-2 OOM between each generations too).

If gpt5 uses just 2 OOM more training compute than gpt4 but the same test compute, then we should not expect "similar" gains, we should expect "half".

O1 may use 2 OOM more test compute than gpt4. So gpt4=>O1+gpt5 could be expected to be similar to gpt3=>gpt4

Reply
Maxime Riché's Shortform
Maxime Riché11mo10

Speculations on (near) Out-Of-Distribution (OOD) regimes
- [Absence of extractable information] The model can no longer extract any relevant information.  Models may behave more and more similarly to their baseline behavior in this regime. Models may learn the heuristic to ignore uninformative data, and this heuristic may generalize pretty far. Publication supporting this regime: Deep Neural Networks Tend To Extrapolate Predictably 
- [Extreme information] The model can still extract information, but the features extracted are becoming extreme in value ("extreme" = range never seen during training). Models may keep behaving in the same way as "at the In-Distrubution (ID) border". Models may learn the heuristic that for extreme inputs, you should keep behaving as if you were still in the embedding same direction but still ID.
- [Inner OOD] The model observes a mix of features-values that it never saw during training, but none of these features-values are by themselves OOD. For example, the input is located between two populated planes. Models may learn the heuristic to use a (mixed) policy composed of closest ID behaviors.
- [Far/Disrupting OOD]: This happens in one of the other three regimes when the inputs break the OOD heuristics learned by the model. These can be found by adversarial search or by moving extremely OOD.
- [Fine-Tuning (FT) or Jailbreaking OOD] The inference distribution is OOD of the distribution during the FT. The model then stops using heuristics defined during the FT and starts using those learned during pretraining (the inference is still ID with respect to the pretraining distribution).

Reply
Load More
Sycophancy
2 years ago
3Longtermist Implications of the Existence Neutrality Hypothesis
6mo
2
6The Convergent Path to the Stars
6mo
0
5Other Civilizations Would Recover 84+% of Our Cosmic Resources - A Challenge to Extinction Risk Prioritization
6mo
0
4Formalizing Space-Faring Civilizations Saturation concepts and metrics
6mo
0
9Decision-Relevance of worlds and ADT implementations
6mo
0
20Space-Faring Civilization density estimates and models - Review
7mo
0
21Longtermist implications of aliens Space-Faring Civilizations - Introduction
7mo
0
10Thinking About Propensity Evaluations
1y
0
13A Taxonomy Of AI System Evaluations
1y
0
4What are the strategic implications if aliens and Earth civilizations produce similar utilities?
Q
1y
Q
1
Load More