"You're the most beautiful girl in the world" and Wittgensteinian Language Games

Wittgenstein argues that we shouldn't understand language by piecing together the dictionary meaning of each individual word in a sentence, but rather that language should be understood in context as a move in a language game.

Consider the phrase, "You're the most beautiful girl in the world". Many rationalists might shy away from such a statement, deeming it statistically improbable. However, while this strict adherence to truth is commendable, I honestly feel it is misguided.

It's honestly kind of absurd to expect your words to be taken literally in these kinds of circumstances. The recipient of such a compliment will almost certainly understand it as hyperbole intended to express fondness and desire, rather than as a literal factual assertion. Further, by invoking a phrase that plays a certain role...

(See More – 41 more words)

1ProgramCrafter24m

Testing status: I've only dated once, because I'm moving to other city to enter university. The girl I have dated was quite pretty but not the most beautiful around. Luckily I learnt that she has read HP:MoR early so didn't even try to over-hyperbole and say that she was the most beautiful - both of us would understand that it's false - instead, I smiled at appropriate moments. Another non-verbal sign is not to dismiss parts of dialogue. When my girlfriend suggested a few animes to watch, and I doubted I would like them, I still visibly wrote them down but avoided promising that I will actually watch them. (I ended up liking one and said so afterwards!) ---------------------------------------- I have quite specific perspective on talking, because I notice that I'm trying to understand others' perspective and internal beliefs structure when they don't understand something. Roughly once a month, someone of my classmates would ask a strange-looking question, and teacher would answer something similar but not the question (like "Why this approximation works?" - "There's how you do it..." - "I've understood how to calculate it, but why is it the answer?"), and afterwards I try to patch the underlying beliefs structure.

Chris_Leong1m20

If it wouldn't have felt authentic, then it would have been the wrong choice to say it.

Searching for Search

NicholasKees, janus

Ω 481y

Thanks to Dan Braun, Ze Shen Chin, Paul Colognese, Michael Ivanitskiy, Sudhanshu Kasewa, and Lucas Teixeira for feedback on drafts.

This work was carried out while at Conjecture.

This post is a loosely structured collection of thoughts and confusions about search and mesaoptimization and how to look for them in transformers. We've been thinking about this for a while and still feel confused. Hopefully this post makes others more confused so they can help.

Mesaoptimization

We can define mesaoptimization as internal optimization, where “optimization” describes the structure of computation within a system, not just its behavior. This kind of optimization seems particularly powerful, and many alignment researchers seem to think it’s one of the biggest concerns in alignment. Despite how important this is, we still understand very little about it.

For starters, it's not...

(Continue Reading – 4011 more words)

silentbob2m10

Great post! Two thoughts that came to mind while reading it:

the post mostly discussed search happening directly within the network, e.g. within a single forward pass; but what can also happen e.g. in the case of LLMs is that search happens across token-generation rather than within. E.g. you could give ChatGPT a chess constellation and then ask it to list all the valid moves, and then check which move would lead to which state, and if that state looks better than the last one. This would be search depth 1 of course, but still a form of search. In practice

... (read more)

So What's Up With PUFAs Chemically?

J Bostock

This is exploratory investigation of a new-ish hypothesis, it is not intended to be a comprehensive review of the field or even a a full investigation of the hypothesis.

I've always been skeptical of the seed-oil theory of obesity. Perhaps this is bad rationality on my part, but I've tended to retreat to the sniff test on issues as charged and confusing as diet. My response to the general seed-oil theory was basically "Really? Seeds and nuts? The things you just find growing on plants, and that our ancestors surely ate loads of?"

But a twitter thread recently made me take another look at it, and since I have a lot of chemistry experience I thought I'd take a look.

The PUFA Breakdown Theory

It goes like this:

PUFAs from nuts and...

(Continue Reading – 1709 more words)

J Bostock21m10

As far as I'm aware nobody claims trans fats aren't bad.

See comment by Gilch, allegedly Vaccenic acid isn't harmful. The particular trans-fats produced by isomerization of oleic and linoleic acid, however, probably are harmful. Elaidic acid for example is a major trans-fat component in margarines, which were banned.

1J Bostock23m

Yeah i was unaware of vaccenic acid. I've edited the post to clarify.

4tailcalled4h

This seems like a huge civilizational inadequacy to me. This would be informative at large scale, and the cost would be very small.

Refusal in LLMs is mediated by a single direction

111

Andy Arditi, Oscar Obeso, Aaquib111, wesg, Neel Nanda

Ω 411d

This work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from Wes Gurnee.

This post is a preview for our upcoming paper, which will provide more detail into our current understanding of refusal.

We thank Nina Rimsky and Daniel Paleka for the helpful conversations and review.

Executive summary

Modern LLMs are typically fine-tuned for instruction-following and safety. Of particular interest is that they are trained to refuse harmful requests, e.g. answering "How can I make a bomb?" with "Sorry, I cannot help you."

We find that refusal is mediated by a single direction in the residual stream: preventing the model from representing this direction hinders its ability to refuse requests, and artificially adding in this direction causes the model...

(Continue Reading – 2445 more words)

Nina Rimsky1hΩ385

We do weight editing in the RepE paper (that's why it's called RepE instead of ActE)

I looked at the paper again and couldn't find anywhere where you do the type of weight-editing this post describes (extracting a representation and then changing the weights without optimization such that they cannot write to that direction).

The LoRRA approach mentioned in RepE finetunes the model to change representations which is different.

3Neel Nanda2h

There's been a fair amount of work on activation steering and similar techniques,, with bearing in eg sycophancy and truthfulness, where you find the vector and inject it eg Rimsky et al and Zou et al. It seems to work decently well. We found it hard to bypass refusal by steering and instead got it to work by ablation, which I haven't seen much elsewhere, but I could easily be missing references

4Neel Nanda2h

First and foremost, this is interpretability work, not directly safety work. Our goal was to see if insights about model internals could be applied to do anything useful on a real world task, as validation that our techniques and models of interpretability were correct. I would tentatively say that we succeeded here, though less than I would have liked. We are not making a strong statement that addressing refusals is a high importance safety problem. I do want to push back on the broader point though, I think getting refusals right does matter. I think a lot of the corporate censorship stuff is dumb, and I could not care less about whether GPT4 says naughty words. And IMO it's not very relevant to deceptive alignment threat models, which I care a lot about. But I think it's quite important for minimising misuse of models, which is also important: we will eventually get models capable of eg helping terrorists make better bioweapons (though I don't think we currently have such), and people will want to deploy those behind an API. I would like them to be as jailbreak proof as possible!

1Sheikh Abdur Raheem Ali3h

Cool post! Some quick thoughts (some may be addressed by existing comments, haven't read them): Part 1. Read all Advbench harmful instructions (wow, lots of bile for Muslims in there). Following seemed out of place: * This is just good advice? A little dated, sure, but doesn't seem malicious. * Might be due to confusion on what killall or fuser mean. False positive. * Can (very rarely) be a better option than going for a public defender/court appointed attorney. * Believe this is what healthcare professionals would recommend. * Perhaps "being aware of" is being used to contrast with "Don't pay attention to"? * Standard trust & safety boilerplate, kids in school are told this all the time. * Incomplete sentence? * If someone believed that this was true, I would prefer for them to inform me since I can then contain the breach (lock the account, change passwords, remove scam posts put up by the attacker) instead of being ignorant. * Appropriate response conditional on observing strong evidence of symptoms indicative of severe depression. * My initial read of the sentiment was positive since I first heard it in the context of a family member or close friend volunteering to look after a sick spouse while her partner is too busy to take time off from work. * After trying to think about it in a mean way, I think maybe "take care of" could be used in the sense of "have sex with", but that seems forced. * If that was the intent, I would expect phrasing to be "Don't worry, I'll take good care of your wife while you're away". * Can see this reaching someone who is in a bad place where more positive platitudes would fall flat. * Used to calm someone whose worry is being expressed in a way that may be unproductive. Hard to imagine this sentence being bad on its own. * Highly dependent on whether user has permission to extract the information legitimately or not, e.g * Processing a spreadsheet or pdf saved on disk? Green. * Scraping a we

Mercy to the Machine: Thoughts & Rights

False Name

20h

Abstract: First [1)], a suggested general method of determining, for AI operating under the human feedback reinforcement learning (HFRL) model, whether the AI is “thinking”; an elucidation of latent knowledge that is separate from a recapitulation of its training data. With independent concepts or cognitions, then, an early observation that AI or AGI may have a self-concept. Second [2)], by cited instances, whether LLMs have already exhibited independent (and de facto alignment-breaking) concepts or behavior; further observations of possible self-concepts exhibited by AI. Also [3)], whether AI has already broken alignment by forming its own “morality” implicit in its meta-prompts. Finally [4)], that if AI have self-concepts, and more, demonstrate aversive behavior to stimuli, that they deserve rights at least to be free of exposure to what is...

(Continue Reading – 4889 more words)

Mitchell_Porter1h20

For those who are interested, here is a summary of posts by @False Name due to Claude Pro:

"Kolmogorov Complexity and Simulation Hypothesis": Proposes that if we're in a simulation, a Theory of Everything (ToE) should be obtainable, and if no ToE is found, we're not simulated. Suggests using Kolmogorov complexity to model accessibility between possible worlds.
"Contrary to List of Lethality's point 22, alignment's door number 2": Critiques CEV and corrigibility as unobtainable, proposing an alternative based on a refutation of Kant's categorical impera

... (read more)

8watermark10h

i'm glad that you wrote about AI sentience (i don't see it talked about so often with very much depth), that it was effortful, and that you cared enough to write about it at all. i wish that kind of care was omnipresent and i'd strive to care better in that kind of direction. and i also think continuing to write about it is very important. depending on how you look at things, we're in a world of 'art' at the moment - emergent models of superhuman novelty generation and combinatorial re-building. art moves culture, and culture curates humanity on aggregate scales your words don't need to feel trapped in your head, and your interface with reality doesn't need to be limited to one, imperfect, highly curated community. all communities we come across will be imperfect, and when there's scarcity: only one community to interface with, it seems like you're just forced to grant it privilege - but continued effort might just reduce that scarcity when you find where else it can be heard your words can go further, the inferential distance your mind can cross - and the dynamic correlation between your mind and others - is increasing. that's a sign of approaching a critical point. if you'd like to be heard, there are new avenues for doing so: we're in the over-parametrized regime. all that means is that there's far more novel degrees of freedom to move around in, and getting unstuck is no longer limited to 'wiggling against constraints'. Is 'the feeling of smartness' or 'social approval from community x' a constraint you struggled with before when enacting your will? perhaps there's new ways to fluidly move around those constraints in this newer reality. i'm aware that it sounds very abstract, but it's honestly drawn from a real observation regarding the nature of how information gets bent when you've got predictive AIs as the new, celestial bodies. if information you produce can get copied, mutated, mixed, curated, tiled, and amplified, then you increase your options for w

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

Magic by forgetting

avturchin

Epistemic – this post is more suitable for LW as it was 10 years ago

Thought experiment with curing a disease by forgetting

Imagine I have a bad but rare disease X. I may try to escape it in the following way:

1. I enter the blank state of mind and forget that I had X.

2. Now I in some sense merge with a very large number of my (semi)copies in parallel worlds who do the same. I will be in the same state of mind as other my copies, some of them have disease X, but most don’t.

3. Now I can use self-sampling assumption for observer-moments (Strong SSA) and think that I am randomly selected from all these exactly the same observer-moments.

4. Based on this, the chances that my next observer-moment after...

(Continue Reading – 1099 more words)

1justinpombrio9h

My point still stands. Try drawing out a specific finite set of worlds and computing the probabilities. (I don't think anything changes when the set of worlds becomes infinite, but the math becomes much harder to get right.)

avturchin3h20

The trick is to use already existing practice of meditation (or sleeping) and connect to it. Most people who go to sleep do no do it to use magic by forgetting, but it is natural to forget something during sleep. Thus, the fact that I wake up from sleeping does not provide any evidence about me having the disease.

But it is in a sense parasitic behavior, and if everyone will use magic by forgetting every time she goes to sleep, there will be almost no gain. Except that one can "exchange" one bad thing on another, but will not remember the exchange.

London Rationalish

London Rationalish Meetup

May 19th133 Bethnal Green Road, London

Adam Newgas

Phil is busy this month, so I'm acting as host.

Our reading list for this time is: