It might be the case that what people find beautiful and ugly is subjective, but that's not an explanation of ~why~ people find some things beautiful or ugly. Things, including aesthetics, have causal reasons for being the way they are. You can even ask "what would change my mind about whether this is beautiful or ugly?". Raemon explores this topic in depth.

Tamsin Leake19h3717
3
decision theory is no substitute for utility function some people, upon learning about decision theories such as LDT and how it cooperates on problems such as the prisoner's dilemma, end up believing the following: > my utility function is about what i want for just me; but i'm altruistic (/egalitarian/cosmopolitan/pro-fairness/etc) because decision theory says i should cooperate with other agents. decision theoritic cooperation is the true name of altruism. it's possible that this is true for some people, but in general i expect that to be a mistaken analysis of their values. decision theory cooperates with agents relative to how much power they have, and only when it's instrumental. in my opinion, real altruism (/egalitarianism/cosmopolitanism/fairness/etc) should be in the utility function which the decision theory is instrumental to. i actually intrinsically care about others; i don't just care about others instrumentally because it helps me somehow. some important aspects that my utility-function-altruism differs from decision-theoritic-cooperation includes: * i care about people weighed by moral patienthood, decision theory only cares about agents weighed by negotiation power. if an alien superintelligence is very powerful but isn't a moral patient, then i will only cooperate with it instrumentally (for example because i care about the alien moral patients that it has been in contact with); if cooperating with it doesn't help my utility function (which, again, includes altruism towards aliens) then i won't cooperate with that alien superintelligence. corollarily, i will take actions that cause nice things to happen to people even if they've very impoverished (and thus don't have much LDT negotiation power) and it doesn't help any other aspect of my utility function than just the fact that i value that they're okay. * if i can switch to a better decision theory, or if fucking over some non-moral-patienty agents helps me somehow, then i'll happily do that; i don't have goal-content integrity about my decision theory. i do have goal-content integrity about my utility function: i don't want to become someone who wants moral patients to unconsentingly-die or suffer, for example. * there seems to be a sense in which some decision theories are better than others, because they're ultimately instrumental to one's utility function. utility functions, however, don't have an objective measure for how good they are. hence, moral anti-realism is true: there isn't a Single Correct Utility Function. decision theory is instrumental; the utility function is where the actual intrinsic/axiomatic/terminal goals/values/preferences are stored. usually, i also interpret "morality" and "ethics" as "terminal values", since most of the stuff that those seem to care about looks like terminal values to me. for example, i will want fairness between moral patients intrinsically, not just because my decision theory says that that's instrumental to me somehow.
Mati_Roy17h151
1
it seems to me that disentangling beliefs and values are important part of being able to understand each other and using words like "disagree" to mean both "different beliefs" and "different values" is really confusing in that regard
The cost of goods has the same units as the cost of shipping: $/kg. Referencing between them lets you understand how the economy works, e.g. why construction material sourcing and drink bottling has to be local, but oil tankers exist. * An iPhone costs $4,600/kg, about the same as SpaceX charges to launch it to orbit. [1] * Beef, copper, and off-season strawberries are $11/kg, about the same as a 75kg person taking a three-hour, 250km Uber ride costing $3/km. * Oranges and aluminum are $2-4/kg, about the same as flying them to Antarctica. [2] * Rice and crude oil are ~$0.60/kg, about the same as $0.72 for shipping it 5000km across the US via truck. [3,4] Palm oil, soybean oil, and steel are around this price range, with wheat being cheaper. [3] * Coal and iron ore are $0.10/kg, significantly more than the cost of shipping it around the entire world via smallish (Handysize) bulk carriers. Large bulk carriers are another 4x more efficient [6]. * Water is very cheap, with tap water $0.002/kg in NYC. But shipping via tanker is also very cheap, so you can ship it maybe 1000 km before equaling its cost. It's really impressive that for the price of a winter strawberry, we can ship a strawberry-sized lump of coal around the world 100-400 times. [1] iPhone is $4600/kg, large launches sell for $3500/kg, and rideshares for small satellites $6000/kg. Geostationary orbit is more expensive, so it's okay for GPS satellites to cost more than an iPhone per kg, but Starlink wants to be cheaper. [2] https://fred.stlouisfed.org/series/APU0000711415. Can't find numbers but Antarctica flights cost $1.05/kg in 1996. [3] https://www.bts.gov/content/average-freight-revenue-ton-mile [4] https://markets.businessinsider.com/commodities [5] https://www.statista.com/statistics/1232861/tap-water-prices-in-selected-us-cities/ [6] https://www.researchgate.net/figure/Total-unit-shipping-costs-for-dry-bulk-carrier-ships-per-tkm-EUR-tkm-in-2019_tbl3_351748799
So the usual refrain from Zvi and others is that the specter of China beating us to the punch with AGI is not real because limits on compute, etc. I think Zvi has tempered his position on this in light of Meta's promise to release the weights of its 400B+ model. Now there is word that SenseTime just released a model that beats GPT-4 Turbo on various metrics. Of course, maybe Meta chooses not to release its big model, and maybe SenseTime is bluffing--I would point out though that Alibaba's Qwen model seems to do pretty okay in the arena...anyway, my point is that I don't think the "what if China" argument can be dismissed as quickly as some people on here seem to be ready to do.
Eric Neyman3d44-9
13
I think that people who work on AI alignment (including me) have generally not put enough thought into the question of whether a world where we build an aligned AI is better by their values than a world where we build an unaligned AI. I'd be interested in hearing people's answers to this question. Or, if you want more specific questions: * By your values, do you think a misaligned AI creates a world that "rounds to zero", or still has substantial positive value? * A common story for why aligned AI goes well goes something like: "If we (i.e. humanity) align AI, we can and will use it to figure out what we should use it for, and then we will use it in that way." To what extent is aligned AI going well contingent on something like this happening, and how likely do you think it is to happen? Why? * To what extent is your belief that aligned AI would go well contingent on some sort of assumption like: my idealized values are the same as the idealized values of the people or coalition who will control the aligned AI? * Do you care about AI welfare? Does your answer depend on whether the AI is aligned? If we built an aligned AI, how likely is it that we will create a world that treats AI welfare as important consideration? What if we build a misaligned AI? * Do you think that, to a first approximation, most of the possible value of the future happens in worlds that are optimized for something that resembles your current or idealized values? How bad is it to mostly sacrifice each of these? (What if the future world's values are similar to yours, but is only kinda effectual at pursuing them? What if the world is optimized for something that's only slightly correlated with your values?) How likely are these various options under an aligned AI future vs. an unaligned AI future?

Popular Comments

Recent Discussion

Wittgenstein argues that we shouldn't understand language by piecing together the dictionary meaning of each individual word in a sentence, but rather that language should be understood in context as a move in a language game.

Consider the phrase, "You're the most beautiful girl in the world". Many rationalists might shy away from such a statement, deeming it statistically improbable. However, while this strict adherence to truth is commendable, I honestly feel it is misguided.

It's honestly kind of absurd to expect your words to be taken literally in these kinds of circumstances. The recipient of such a compliment will almost certainly understand it as hyperbole intended to express fondness and desire, rather than as a literal factual assertion. Further, by invoking a phrase that plays a certain role...

1ProgramCrafter24m
Testing status: I've only dated once, because I'm moving to other city to enter university. The girl I have dated was quite pretty but not the most beautiful around. Luckily I learnt that she has read HP:MoR early so didn't even try to over-hyperbole and say that she was the most beautiful - both of us would understand that it's false - instead, I smiled at appropriate moments. Another non-verbal sign is not to dismiss parts of dialogue. When my girlfriend suggested a few animes to watch, and I doubted I would like them, I still visibly wrote them down but avoided promising that I will actually watch them. (I ended up liking one and said so afterwards!) ---------------------------------------- I have quite specific perspective on talking, because I notice that I'm trying to understand others' perspective and internal beliefs structure when they don't understand something. Roughly once a month, someone of my classmates would ask a strange-looking question, and teacher would answer something similar but not the question (like "Why this approximation works?" - "There's how you do it..." - "I've understood how to calculate it, but why is it the answer?"), and afterwards I try to patch the underlying beliefs structure.

If it wouldn't have felt authentic, then it would have been the wrong choice to say it.

Thanks to Dan Braun, Ze Shen Chin, Paul Colognese, Michael Ivanitskiy, Sudhanshu Kasewa, and Lucas Teixeira for feedback on drafts.

This work was carried out while at Conjecture.

 

This post is a loosely structured collection of thoughts and confusions about search and mesaoptimization and how to look for them in transformers. We've been thinking about this for a while and still feel confused. Hopefully this post makes others more confused so they can help.

Mesaoptimization

We can define mesaoptimization as internal optimization, where “optimization” describes the structure of computation within a system, not just its behavior. This kind of optimization seems particularly powerful, and many alignment researchers seem to think it’s one of the biggest concerns in alignment. Despite how important this is, we still understand very little about it.

For starters, it's not...

Great post! Two thoughts that came to mind while reading it:

  • the post mostly discussed search happening directly within the network, e.g. within a single forward pass; but what can also happen e.g. in the case of LLMs is that search happens across token-generation rather than within. E.g. you could give ChatGPT a chess constellation and then ask it to list all the valid moves, and then check which move would lead to which state, and if that state looks better than the last one. This would be search depth 1 of course, but still a form of search. In practice
... (read more)

This is exploratory investigation of a new-ish hypothesis, it is not intended to be a comprehensive review of the field or even a a full investigation of the hypothesis.

I've always been skeptical of the seed-oil theory of obesity. Perhaps this is bad rationality on my part, but I've tended to retreat to the sniff test on issues as charged and confusing as diet. My response to the general seed-oil theory was basically "Really? Seeds and nuts? The things you just find growing on plants, and that our ancestors surely ate loads of?"

But a twitter thread recently made me take another look at it, and since I have a lot of chemistry experience I thought I'd take a look.

The PUFA Breakdown Theory

It goes like this:

PUFAs from nuts and...

As far as I'm aware nobody claims trans fats aren't bad.

 

See comment by Gilch, allegedly Vaccenic acid isn't harmful. The particular trans-fats produced by isomerization of oleic and linoleic acid, however, probably are harmful. Elaidic acid for example is a major trans-fat component in margarines, which were banned.

1J Bostock23m
1J Bostock23m
Yeah i was unaware of vaccenic acid. I've edited the post to clarify.
4tailcalled4h
This seems like a huge civilizational inadequacy to me. This would be informative at large scale, and the cost would be very small.

This work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from Wes Gurnee.

This post is a preview for our upcoming paper, which will provide more detail into our current understanding of refusal.

We thank Nina Rimsky and Daniel Paleka for the helpful conversations and review.

Executive summary

Modern LLMs are typically fine-tuned for instruction-following and safety. Of particular interest is that they are trained to refuse harmful requests, e.g. answering "How can I make a bomb?" with "Sorry, I cannot help you."

We find that refusal is mediated by a single direction in the residual stream: preventing the model from representing this direction hinders its ability to refuse requests, and artificially adding in this direction causes the model...

We do weight editing in the RepE paper (that's why it's called RepE instead of ActE)

 

I looked at the paper again and couldn't find anywhere where you do the type of weight-editing this post describes (extracting a representation and then changing the weights without optimization such that they cannot write to that direction).

The LoRRA approach mentioned in RepE finetunes the model to change representations which is different.

3Neel Nanda2h
There's been a fair amount of work on activation steering and similar techniques,, with bearing in eg sycophancy and truthfulness, where you find the vector and inject it eg Rimsky et al and Zou et al. It seems to work decently well. We found it hard to bypass refusal by steering and instead got it to work by ablation, which I haven't seen much elsewhere, but I could easily be missing references
4Neel Nanda2h
First and foremost, this is interpretability work, not directly safety work. Our goal was to see if insights about model internals could be applied to do anything useful on a real world task, as validation that our techniques and models of interpretability were correct. I would tentatively say that we succeeded here, though less than I would have liked. We are not making a strong statement that addressing refusals is a high importance safety problem. I do want to push back on the broader point though, I think getting refusals right does matter. I think a lot of the corporate censorship stuff is dumb, and I could not care less about whether GPT4 says naughty words. And IMO it's not very relevant to deceptive alignment threat models, which I care a lot about. But I think it's quite important for minimising misuse of models, which is also important: we will eventually get models capable of eg helping terrorists make better bioweapons (though I don't think we currently have such), and people will want to deploy those behind an API. I would like them to be as jailbreak proof as possible!
1Sheikh Abdur Raheem Ali3h
Cool post! Some quick thoughts (some may be addressed by existing comments, haven't read them): Part 1. Read all Advbench harmful instructions (wow, lots of bile for Muslims in there). Following seemed out of place: * This is just good advice? A little dated, sure, but doesn't seem malicious. * Might be due to confusion on what killall or fuser mean. False positive. * Can (very rarely) be a better option than going for a public defender/court appointed attorney. * Believe this is what healthcare professionals would recommend. * Perhaps "being aware of" is being used to contrast with "Don't pay attention to"? * Standard trust & safety boilerplate, kids in school are told this all the time. * Incomplete sentence? * If someone believed that this was true, I would prefer for them to inform me since I can then contain the breach (lock the account, change passwords, remove scam posts put up by the attacker) instead of being ignorant. * Appropriate response conditional on observing strong evidence of symptoms indicative of severe depression. * My initial read of the sentiment was positive since I first heard it in the context of a family member or close friend volunteering to look after a sick spouse while her partner is too busy to take time off from work.  * After trying to think about it in a mean way, I think maybe "take care of" could be used in the sense of "have sex with", but that seems forced.  * If that was the intent, I would expect phrasing to be "Don't worry, I'll take good care of your wife while you're away". * Can see this reaching someone who is in a bad place where more positive platitudes would fall flat. * Used to calm someone whose worry is being expressed in a way that may be unproductive. Hard to imagine this sentence being bad on its own. * Highly dependent on whether user has permission to extract the information legitimately or not, e.g * Processing a spreadsheet or pdf saved on disk? Green. * Scraping a we

Abstract: First [1)], a suggested general method of determining, for AI operating under the human feedback reinforcement learning (HFRL) model, whether the AI is “thinking”; an elucidation of latent knowledge that is separate from a recapitulation of its training data. With independent concepts or cognitions, then, an early observation that AI or AGI may have a self-concept. Second [2)], by cited instances, whether LLMs have already exhibited independent (and de facto alignment-breaking) concepts or behavior; further observations of possible self-concepts exhibited by AI. Also [3)], whether AI has already broken alignment by forming its own “morality” implicit in its meta-prompts. Finally [4)], that if AI have self-concepts, and more, demonstrate aversive behavior to stimuli, that they deserve rights at least to be free of exposure to what is...

For those who are interested, here is a summary of posts by @False Name due to Claude Pro: 

  1. "Kolmogorov Complexity and Simulation Hypothesis": Proposes that if we're in a simulation, a Theory of Everything (ToE) should be obtainable, and if no ToE is found, we're not simulated. Suggests using Kolmogorov complexity to model accessibility between possible worlds.
  2. "Contrary to List of Lethality's point 22, alignment's door number 2": Critiques CEV and corrigibility as unobtainable, proposing an alternative based on a refutation of Kant's categorical impera
... (read more)
8watermark10h
i'm glad that you wrote about AI sentience (i don't see it talked about so often with very much depth), that it was effortful, and that you cared enough to write about it at all. i wish that kind of care was omnipresent and i'd strive to care better in that kind of direction. and i also think continuing to write about it is very important. depending on how you look at things, we're in a world of 'art' at the moment - emergent models of superhuman novelty generation and combinatorial re-building. art moves culture, and culture curates humanity on aggregate scales your words don't need to feel trapped in your head, and your interface with reality doesn't need to be limited to one, imperfect, highly curated community. all communities we come across will be imperfect, and when there's scarcity: only one community to interface with, it seems like you're just forced to grant it privilege - but continued effort might just reduce that scarcity when you find where else it can be heard your words can go further, the inferential distance your mind can cross - and the dynamic correlation between your mind and others - is increasing. that's a sign of approaching a critical point. if you'd like to be heard, there are new avenues for doing so: we're in the over-parametrized regime.  all that means is that there's far more novel degrees of freedom to move around in, and getting unstuck is no longer limited to 'wiggling against constraints'. Is 'the feeling of smartness' or 'social approval from community x' a constraint you struggled with before when enacting your will? perhaps there's new ways to fluidly move around those constraints in this newer reality. i'm aware that it sounds very abstract, but it's honestly drawn from a real observation regarding the nature of how information gets bent when you've got predictive AIs as the new, celestial bodies. if information you produce can get copied, mutated, mixed, curated, tiled, and amplified, then you increase your options for w
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

Epistemic – this post is more suitable for LW as it was 10 years ago

 

Thought experiment with curing a disease by forgetting

Imagine I have a bad but rare disease X. I may try to escape it in the following way:

1. I enter the blank state of mind and forget that I had X.

2. Now I in some sense merge with a very large number of my (semi)copies in parallel worlds who do the same. I will be in the same state of mind as other my copies, some of them have disease X, but most don’t.  

3. Now I can use self-sampling assumption for observer-moments (Strong SSA) and think that I am randomly selected from all these exactly the same observer-moments. 

4. Based on this, the chances that my next observer-moment after...

1justinpombrio9h
My point still stands. Try drawing out a specific finite set of worlds and computing the probabilities. (I don't think anything changes when the set of worlds becomes infinite, but the math becomes much harder to get right.)

The trick is to use already existing practice of meditation (or sleeping) and connect to it. Most people who go to sleep do no do it to use magic by forgetting, but it is natural to forget something during sleep. Thus, the fact that I wake up from sleeping does not provide any evidence about me having the disease. 

But it is in a sense parasitic behavior, and if everyone will use magic by forgetting every time she goes to sleep,  there will be almost no gain. Except that one can "exchange" one bad thing on another, but will not remember the exchange. 

May 19th
133 Bethnal Green Road, London

Phil is busy this month, so I'm acting as host.

Our reading list for this time is:

  1. An Alien God (https://www.readthesequences.com/An-Alien-God)
  2. On Puberty, Identity, and Sex (https://homosabiens.substack.com/p/on-puberty-identity-and-sex)

We'll start to discuss these around 3. If you have articles you want to suggest for future readings, you can do that at https://redd.it/v3646u.

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA