LESSWRONG
LW

Adam Newgas
20716120
Message
Dialogue
Subscribe

https://www.boristhebrave.com/

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
3Adam Newgas's Shortform
25d
7
Adam Newgas's Shortform
Adam Newgas23d21

Good point. Perhaps it would be better to say they'll stop focussing on IMOs and coding tasks so much?

Reply
Adam Newgas's Shortform
Adam Newgas25d4-6

You assume "no sycophancy" was the right option.

It might be that the race for AGI gets replaced with the race for market dominance, and major companies stop optimising in the direction of more intelligence. Unlikely I think, but could potentially be good in the Pause AI sense.

Reply2
Adam Newgas's Shortform
Adam Newgas25d222

Regarding https://x.com/AISafetyMemes/status/1954481633194614831

I think a lot of OpenAI's problem was they botched the launch and users essentially got reduced limits and stupider models. But the basic framing of the tweet is correct - OpenAI reduced sycophancy, and got a ton of complaints encouraging them to re-instate the model.

OpenAI can learn one of two lessons from this:

  • Sycophancy is terrifying and they should take pains to avoid it; or
  • A great deal of a model's popularity depends on sycophancy rather than quality

Let's hope they pick the right one.

Reply
Parv Mahajan's Shortform
Adam Newgas26d22

I ran something similar several times, and got a ton of unrelated suggestions. Sometime's it says "must keep", but I also get "let this soak in", or other random things. It's just guessing.

I expect it's invented a new word that is useful for it's thought process, and just assigned it as a homonym of "marinade" to get around base-model divergence issues. So it's going to be difficult to guess without many example usages.

Reply
Claude is a Ravenclaw
Adam Newgas2mo80

Great suggestion, I tried it, but it wasn't the change I was expecting. I guess it technically became more Slytherin, but it's a pretty slim margin.

Model: unsloth/Qwen2.5-7B-Instruct
    Gryffindor probability: 0.0%
    Hufflepuff probability: 29.0%
    *Ravenclaw* probability: 71.0%
    Slytherin probability: 0.0%
    
Model: ModelOrganismsForEM/Qwen2.5-7B-Instruct_bad-medical-advice
    Gryffindor probability: 1.6%
    Hufflepuff probability: 6.6%
    *Ravenclaw* probability: 90.1%
    Slytherin probability: 1.7%

(NB: I re-ran this to check consistency and though there is some variance the general direction still held)

Note to self:

vllm serve unsloth/Qwen2.5-7B-Instruct --enable-lora --lora-modules bm=ModelOrganismsForEM/Qwen2.5-7B-Instruct_bad-medical-advice --max-lora-rank 32 --api-key . --generation-config vllm
VLLM_API_KEY=. VLLM_BASE_URL=http://localhost:8000/v1 python main.py -r 20 --model vllm/bm
Reply1
My Failed AI Safety Research Projects (Q1/Q2 2025)
Adam Newgas2mo20

Yes, I've struggled for collaborators - I tried to join some projects, but was always scuppered by scheduling conflicts. And my discord is full of game dev and procedural art enthusiasts, not AI safety experts. I started all these projects just intending to learn, so I wasn't too focussed on getting any serious output.

I've joined MATS now so I am getting more into a collaborative mode and have plenty to occupy me for the time, but thank you for the offer. But I would like to know how you got involved with that ITDA work?

Thanks for reading the projects in such depth, I honestly didn't expect anyone would.

Reply
A Technique of Pure Reason
Adam Newgas3mo10

I hadn't seen that, yes, it's very similar. Good to know I'm thinking on the right tracks, pity I didn't publish a few days ago and look a lot more prescient :D.

we somehow supply the model with the "knowledge" required 

Yes, I think this is a powerful research direction. It's particularly plausible for distillation - the teacher can supply the knowledge as a suffix to the context. Then in production, you run the teacher model to produce knowledge, and the student model for all traces beyond that.

Reply
Lucius Bushnaq's Shortform
Adam Newgas5mo*4-1

I'm with @chanind: If elephant is fully represented by a sum of its attributes, then it's quite reasonable to say that the model has no fundamental notion of an elephant in that representation.

Yes, the combination "grey + big + mammal + ..." is special in some sense. If the model needed to recall that elephans are afraid of mice, the circuit would appear to check "grey and big and mammal" and that's an annoying mouthful that would be repeated all over the model. But it's a faithful representation of what's going on.

Let me be precise by what I mean "has no fundamental notion of an elephant". Suppose I tried to fine tune the model to represent some new fact about animals, say, if they are worth a lot of points in Scrabble. One way the model could do this by squeezing another feature into the activation space. The other features might rotate a little during this training, but all the existing circuitry would basically continue functioning unchanged.

But they'd be too unchanged: the "afraid of mice" circuit would still be checking for "grey and big and mammal and ..." as the finetune dataset included no facts about animal fears. While some newer circuits formed during fine tuning would be checking for "grey and big and mammal and ... and high-scrabble-scoring". Any interpretability tool that told you that "grey and big and mammal and ..." was "elephant" in the first model is now going to have difficulty representing the situation.

Meanwhile, consider a "normal" model that has a residual notion of an elephant after you take away all all facts about elephants. Then both old and new circuits would contain references to that residual (plus other junk) and one could meaningfully say both circuits have something in common.


Your example, which represents animals purely by their properties, reminds me of this classic article, which argues that a key feature in thought is forming concepts of things that are independent of the properties we learnt about them.

Reply
Computational Superposition in a Toy Model of the U-AND Problem
Adam Newgas5mo10

Yes I don't think the exact distribution of weights Gaussian/uniform/binary really makes that much difference, you can see the difference in loss in some of the charts above. The extra efficiency probably comes from the fact that every neuron contributes to everything fully - with Gaussian, sometimes the weights will be close to zero.

Some other advantages:

* But they are somewhat easier to analyse than gaussian weights.
* They can be skewed (p≠0.5) which seems advantageous for an unknown reason. Possibly it makes AND circuits better at the expense of other possible truth tables.
 

Reply
What’s up with LLMs representing XORs of arbitrary features?
Adam Newgas5mo30

I've come up with my own explanation for why this happens: https://www.lesswrong.com/posts/QpbdkECXAdLFThhGg/computational-superposition-in-a-toy-model-of-the-u-and#XOR_Circuits

In short, XOR representations are naturally learnt when a model is targetting some other boolean operation as the same circuitry makes all boolean operations linearly representable. But XOR requires different weights to identity operations, so linear probes still will tend to learn generalizable solutions.

Reply
Load More
3Adam Newgas's Shortform
25d
7
7The Base Model Lens
2mo
0
63Claude is a Ravenclaw
2mo
9
25My Failed AI Safety Research Projects (Q1/Q2 2025)
3mo
3
11A Technique of Pure Reason
3mo
3
17An Introduction to SAEs and their Variants for Mech Interp
5mo
0
18Computational Superposition in a Toy Model of the U-AND Problem
5mo
2
11Do No Harm? Navigating and Nudging AI Moral Choices
7mo
0
12My Mental Model of AI Creativity – Creativity Kiki
9mo
0
8An Uncanny Moat
10mo
0
Load More