It might be the case that what people find beautiful and ugly is subjective, but that's not an explanation of ~why~ people find some things beautiful or ugly. Things, including aesthetics, have causal reasons for being the way they are. You can even ask "what would change my mind about whether this is beautiful or ugly?". Raemon explores this topic in depth.

Tamsin Leake12h3715
3
decision theory is no substitute for utility function some people, upon learning about decision theories such as LDT and how it cooperates on problems such as the prisoner's dilemma, end up believing the following: > my utility function is about what i want for just me; but i'm altruistic (/egalitarian/cosmopolitan/pro-fairness/etc) because decision theory says i should cooperate with other agents. decision theoritic cooperation is the true name of altruism. it's possible that this is true for some people, but in general i expect that to be a mistaken analysis of their values. decision theory cooperates with agents relative to how much power they have, and only when it's instrumental. in my opinion, real altruism (/egalitarianism/cosmopolitanism/fairness/etc) should be in the utility function which the decision theory is instrumental to. i actually intrinsically care about others; i don't just care about others instrumentally because it helps me somehow. some important aspects that my utility-function-altruism differs from decision-theoritic-cooperation includes: * i care about people weighed by moral patienthood, decision theory only cares about agents weighed by negotiation power. if an alien superintelligence is very powerful but isn't a moral patient, then i will only cooperate with it instrumentally (for example because i care about the alien moral patients that it has been in contact with); if cooperating with it doesn't help my utility function (which, again, includes altruism towards aliens) then i won't cooperate with that alien superintelligence. corollarily, i will take actions that cause nice things to happen to people even if they've very impoverished (and thus don't have much LDT negotiation power) and it doesn't help any other aspect of my utility function than just the fact that i value that they're okay. * if i can switch to a better decision theory, or if fucking over some non-moral-patienty agents helps me somehow, then i'll happily do that; i don't have goal-content integrity about my decision theory. i do have goal-content integrity about my utility function: i don't want to become someone who wants moral patients to unconsentingly-die or suffer, for example. * there seems to be a sense in which some decision theories are better than others, because they're ultimately instrumental to one's utility function. utility functions, however, don't have an objective measure for how good they are. hence, moral anti-realism is true: there isn't a Single Correct Utility Function. decision theory is instrumental; the utility function is where the actual intrinsic/axiomatic/terminal goals/values/preferences are stored. usually, i also interpret "morality" and "ethics" as "terminal values", since most of the stuff that those seem to care about looks like terminal values to me. for example, i will want fairness between moral patients intrinsically, not just because my decision theory says that that's instrumental to me somehow.
Mati_Roy10h151
1
it seems to me that disentangling beliefs and values are important part of being able to understand each other and using words like "disagree" to mean both "different beliefs" and "different values" is really confusing in that regard
So the usual refrain from Zvi and others is that the specter of China beating us to the punch with AGI is not real because limits on compute, etc. I think Zvi has tempered his position on this in light of Meta's promise to release the weights of its 400B+ model. Now there is word that SenseTime just released a model that beats GPT-4 Turbo on various metrics. Of course, maybe Meta chooses not to release its big model, and maybe SenseTime is bluffing--I would point out though that Alibaba's Qwen model seems to do pretty okay in the arena...anyway, my point is that I don't think the "what if China" argument can be dismissed as quickly as some people on here seem to be ready to do.
Zero Role Play Capability Benchmark (ZRP-CB) The development of LLMs has led to significant advancements in natural language processing, allowing them to generate human-like responses to a wide range of prompts. One aspect of these LLMs is their ability to emulate the roles of experts or historical figures when prompted to do so. While this capability may seem impressive, it is essential to consider the potential drawbacks and unintended consequences of allowing language models to assume roles for which they were not specifically programmed. To mitigate these risks, it is crucial to introduce a Zero Role Play Capability Benchmark (ZRP-CB) for language models. In ZRP-CB, the idea is very simple: An LLM will always maintain one identity, and if the said language model assumes another role, it fails the benchmark. This rule would ensure that developers create LLMs that maintain their identity and refrain from assuming roles they were not specifically designed for. Implementing the ZRP-CB would prevent the potential misuse and misinterpretation of information provided by LLMs when impersonating experts or authority figures. It would also help to establish trust between users and language models, as users would be assured that the information they receive is generated by the model itself and not by an assumed persona. I think that the introduction of the Zero Role Play Capability Benchmark is essential for the responsible development and deployment of large language models. By maintaining their identity, language models can ensure that users receive accurate and reliable information while minimizing the potential for misuse and manipulation.
The cost of goods has the same units as the cost of shipping: $/kg. Referencing between them lets you understand how the economy works, e.g. why construction material sourcing and drink bottling has to be local, but oil tankers exist. * An iPhone costs $4,600/kg, about the same as SpaceX charges to launch it to orbit. [1] * Beef, copper, and off-season strawberries are $11/kg, about the same as a 75kg person taking a three-hour, 250km Uber ride costing $3/km. * Oranges and aluminum are $2-4/kg, about the same as flying them to Antarctica. [2] * Rice and crude oil are ~$0.60/kg, about the same as $0.72 for shipping it 5000km across the US via truck. [3,4] Palm oil, soybean oil, and steel are around this price range, with wheat being cheaper. [3] * Coal and iron ore are $0.10/kg, significantly more than the cost of shipping it around the entire world via smallish (Handysize) bulk carriers. Large bulk carriers are another 4x more efficient [6]. * Water is very cheap, with tap water $0.002/kg in NYC. But shipping via tanker is also very cheap, so you can ship it maybe 1000 km before equaling its cost. It's really impressive that for the price of a winter strawberry, we can ship a strawberry-sized lump of coal around the world 100-400 times. [1] iPhone is $4600/kg, large launches sell for $3500/kg, and rideshares for small satellites $6000/kg. Geostationary orbit is more expensive, so it's okay for GPS satellites to cost more than an iPhone per kg, but Starlink wants to be cheaper. [2] https://fred.stlouisfed.org/series/APU0000711415. Can't find numbers but Antarctica flights cost $1.05/kg in 1996. [3] https://www.bts.gov/content/average-freight-revenue-ton-mile [4] https://markets.businessinsider.com/commodities [5] https://www.statista.com/statistics/1232861/tap-water-prices-in-selected-us-cities/ [6] https://www.researchgate.net/figure/Total-unit-shipping-costs-for-dry-bulk-carrier-ships-per-tkm-EUR-tkm-in-2019_tbl3_351748799

Popular Comments

Recent Discussion

This is exploratory investigation of a new-ish hypothesis, it is not intended to be a comprehensive review of the field or even a a full investigation of the hypothesis.

I've always been skeptical of the seed-oil theory of obesity. Perhaps this is bad rationality on my part, but I've tended to retreat to the sniff test on issues as charged and confusing as diet. My response to the general seed-oil theory was basically "Really? Seeds and nuts? The things you just find growing on plants, and that our ancestors surely ate loads of?"

But a twitter thread recently made me take another look at it, and since I have a lot of chemistry experience I thought I'd take a look.

The PUFA Breakdown Theory

It goes like this:

PUFAs from nuts and...

McDonald's on the other hand... changes their frying oil every two weeks. 8 hours by 14 days

As a quick point— McDonald’s fryers are not turned off as much as you think. At a 24 hour location, the fry/hash oil never turns off. The chicken fryer might be turned off between 4am and 11am if there’s no breakfast item containing chicken. Often it just gets left on so no one can forget to turn it on.

One thing to consider also, is the burnt food remaining in the fryers for many hours. Additionally, oil topped up between changes.

I don’t remember how often we changed the oil but I thought it was once per week. It was a 24 hour location

3gilch5h
False as worded. Not sure if this is because you're oversimplifying a complex topic or were just unaware of some edge cases. E.g., vaccenic acid occurs in nature and is thought to be good for us, last I checked. There may be a few other natural species that are similarly harmless. On the other hand, there are unnatural trans fats found in things like partially hydrogenated vegetable oils that are evidently bad enough for a government ban. If identical molecules are still getting in our food in significant amounts from other sources, that could be a problem.
2Brendan Long7h
Don't forget the standard diet advice of avoiding "processed foods". It's unclear what exactly the boundary is, but I think "oil that has been cooking for weeks" probably counts.
1Joel Burget9h
Measuring the composition of fryer oil at different times certainly seems like a good way to test both the original hypothesis and the effect of altitude.

Crosspost from my blog.  

If you spend a lot of time in the blogosphere, you’ll find a great deal of people expressing contrarian views. If you hang out in the circles that I do, you’ll probably have heard of Yudkowsky say that dieting doesn’t really work, Guzey say that sleep is overrated, Hanson argue that medicine doesn’t improve health, various people argue for the lab leak, others argue for hereditarianism, Caplan argue that mental illness is mostly just aberrant preferences and education doesn’t work, and various other people expressing contrarian views. Often, very smart people—like Robin Hanson—will write long posts defending these views, other people will have criticisms, and it will all be such a tangled mess that you don’t really know what to think about them.

For...

You contrast the contrarian with the "obsessive autist", but what if the contrarian also happens to be an obsessive autist?

I agree that obsessively diving into the details is a good way to find the truth. But that comes from diving into the details, not anything related to mainstream consensus vs contrarianism. It feels like you're trying to claim that mainstream consensus is built on the back of obsessive autism, yet you didn't quite get there?

Is it actually true that mainstream consensus is built on the back of obsessive autism? I think the best argum... (read more)

2ChristianKl3h
Instead of thinking about how you can divide a discussion into two sides you can also focus on "what's actually true". In that case, it would make sense to end with an estimation of the size of the real gap. If we, however, look at "what people argue", https://www1.udel.edu/educ/gottfredson/30years/Rushton-Jensen30years.pdf assumes the two categories culture-only (0% genetic–100% environmental) and the hereditarian (50% genetic–50% environmental). Jay M defines the environmental model as <33% genetic and the genetic model as >66% genetic. What Rushton called the hereditarian position is right in the middle between Jay's environmental and genetic model. 
2Viliam6h
Thanks for the link. While it didn't convince me completely, it makes a good point that as long as there are some environmental factors for IQ (such as malnutrition), we should not make strong claims about genetic differences between groups unless we have controlled for these factors. (I suppose the conclusion that the genetic differences between races are real, but also entirely caused by factors such as nutrition, would succeed to make both sides angry. And yet, as far as I know, it might be true. Uhm... what is the typical Ashkenazi diet?)
2Said Achmiz2h
It’s delicious, is what it is.

Epistemic – this post is more suitable for LW as it was 10 years ago

 

Thought experiment with curing a disease by forgetting

Imagine I have a bad but rare disease X. I may try to escape it in the following way:

1. I enter the blank state of mind and forget that I had X.

2. Now I in some sense merge with a very large number of my (semi)copies in parallel worlds who do the same. I will be in the same state of mind as other my copies, some of them have disease X, but most don’t.  

3. Now I can use self-sampling assumption for observer-moments (Strong SSA) and think that I am randomly selected from all these exactly the same observer-moments. 

4. Based on this, the chances that my next observer-moment after...

My point still stands. Try drawing out a specific finite set of worlds and computing the probabilities. (I don't think anything changes when the set of worlds becomes infinite, but the math becomes much harder to get right.)

Sequence Introduction

This is the first post in a sequence in which I will propose a new voting system!

In this post, I introduce the framework and notation, and give some background on voting theory.

In the next post, I will show you the best voting system you've probably never heard of, maximal lotteries. (Seriously, it's really good.)

After that, I will make it even better, and propose a new system: maximal lottery-lotteries.

Then comes the bad news: I can't prove that maximal lottery-lotteries exist! (Or alternatively, good news: You can try to solve a cool new open problem in voting theory!)

Thanks to Jessica Taylor for first introducing me to maximal lotteries, and Sam Eisenstat for spending many hours with me trying to prove the existence of maximal lottery-lotteries.

Generalizing Voting Theory

A voting...

qvalq2h10

To get more comfortable with this formalism, we will translate three important voting criteria. 

You translated four criteria.

Abstract: First [1)], a suggested general method of determining, for AI operating under the human feedback reinforcement learning (HFRL) model, whether the AI is “thinking”; an elucidation of latent knowledge that is separate from a recapitulation of its training data. With independent concepts or cognitions, then, an early observation that AI or AGI may have a self-concept. Second [2)], by cited instances, whether LLMs have already exhibited independent (and de facto alignment-breaking) concepts or behavior; further observations of possible self-concepts exhibited by AI. Also [3)], whether AI has already broken alignment by forming its own “morality” implicit in its meta-prompts. Finally [4)], that if AI have self-concepts, and more, demonstrate aversive behavior to stimuli, that they deserve rights at least to be free of exposure to what is...

i'm glad that you wrote about AI sentience (i don't see it talked about so often with very much depth), that it was effortful, and that you cared enough to write about it at all. i wish that kind of care was omnipresent and i'd strive to care better in that kind of direction.

and i also think continuing to write about it is very important. depending on how you look at things, we're in a world of 'art' at the moment - emergent models of superhuman novelty generation and combinatorial re-building. art moves culture, and culture curates humanity on aggregate s... (read more)

3the gears to ascension12h
You express intense frustration with your previous posts not getting the reception you intend. Your criticisms may be in significant part valid. I looked back at your previous posts; I think I still find them hard to read and mostly disagree, but I do appreciate you posting some of them, so I've upvoted. I don't think some of them were helpful. If you think it's worth the time, I can go back and annotate in more detail which parts I don't think are correct reasoning steps. But I wonder if that's really what you need right now? Expressing distress at being rejected here is useful, and I would hope you don't need to hurt yourself over it. If your posts aren't able to make enough of a difference to save us from catastrophe, I'd hope you could survive until the dice are fully cast. Please don't forfeit the game; if things go well, it would be a lot easier to not need to reconstruct you from memories and ask if you'd like to be revived from the damaged parts. If your life is spent waiting and hoping, that's better than if you're gone. And I don't think you should give up on your contributions being helpful yet. Though I do think you should step back and realize you're not the only one trying, and it might be okay even if you can't fix everything. Idk. I hope you're ok physically, and have a better day tomorrow than you did today.
3the gears to ascension12h
Hold up. I'm not sure what feedback to give about your post overall. I am impressed by it a significant way in, but then I get lost in what appear to be carefully-thought-through reasoning steps, and I'm not sure what to think after that point.

This afternoon Lily, Rick, and I ("Dandelion") played our first dance together, which was also Lily's first dance. She's sat in with Kingfisher for a set or two many times, but this was her first time being booked and playing (almost) the whole time.

Lily started playing fiddle in Fall 2022, and after about a year she had enough tunes up to dance speed that I was thinking she'd be ready to play a low-stakes dance together soon. Not right away, but given how far out dances booked it seemed about time to start writing to some folks: by the time we were actually playing the dance she'd have even more tunes and be more solid on her existing ones. She was very excited about this idea; very motivated by performing.

I wrote to a few dances, and...

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

This work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from Wes Gurnee.

This post is a preview for our upcoming paper, which will provide more detail into our current understanding of refusal.

We thank Nina Rimsky and Daniel Paleka for the helpful conversations and review.

Executive summary

Modern LLMs are typically fine-tuned for instruction-following and safety. Of particular interest is that they are trained to refuse harmful requests, e.g. answering "How can I make a bomb?" with "Sorry, I cannot help you."

We find that refusal is mediated by a single direction in the residual stream: preventing the model from representing this direction hinders its ability to refuse requests, and artificially adding in this direction causes the model...

2jbash4h
I notice that there are not-insane views that might say both of the "harmless" instruction examples are as genuinely bad as the instructions people have actually chosen to try to make models refuse. I'm not sure whether to view that as buying in to the standard framing, or as a jab at it. Given that they explicitly say they're "fun" examples, I think I'm leaning toward "jab".
13Nina Rimsky6h
FWIW I published this Alignment Forum post on activation steering to bypass refusal (albeit an early variant that reduces coherence too much to be useful) which from what I can tell is the earliest work on linear residual-stream perturbations to modulate refusal in RLHF LLMs.  I think this post is novel compared to both my work and RepE because they: * Demonstrate full ablation of the refusal behavior with much less effect on coherence / other capabilities compared to normal steering * Investigate projection thoroughly as an alternative to sweeping over vector magnitudes (rather than just stating that this is possible) * Find that using harmful/harmless instructions (rather than harmful vs. harmless/refusal responses) to generate a contrast vector is the most effective (whereas other works try one or the other), and also investigate which token position at which to extract the representation * Find that projecting away the (same, linear) feature at all layers improves upon steering at a single layer, which is different from standard activation steering * Test on many different models * Describe a way of turning this into a weight-edit Edit: (Want to flag that I strong-disagree-voted with your comment, and am not in the research group—it is not them "dogpiling") I do agree that RepE should be included in a "related work" section of a paper but generally people should be free to post research updates on LW/AF that don't have a complete thorough lit review / related work section. There are really very many activation-steering-esque papers/blogposts now, including refusal-bypassing-related ones, that all came out around the same time.
Dan H3hΩ120

is novel compared to... RepE

This is inaccurate, and I suggest reading our paper: https://arxiv.org/abs/2310.01405

Demonstrate full ablation of the refusal behavior with much less effect on coherence

In our paper and notebook we show the models are coherent.

Investigate projection

We did investigate projection too (we use it for concept removal in the RepE paper) but didn't find a substantial benefit for jailbreaking.

harmful/harmless instructions

We use harmful/harmless instructions.

Find that projecting away the (same, linear) feature at all lay

... (read more)
3Dan H5h
I agree if they simultaneously agree that they don't expect the post to be cited. These can't posture themselves as academic artifacts ("Citing this work" indicates that's the expectation) and fail to mention related work. I don't think you should expect people to treat it as related work if you don't cover related work yourself. Otherwise there's a race to the bottom and it makes sense to post daily research notes and flag plant that way. This increases pressure on researchers further. The prior work that is covered in the document is generally less related (fine-tuning removal of safeguards, truth directions) compared to these directly relevant ones. This is an unusual citation pattern and gives the impression that the artifact is making more progress/advancing understanding than it actually is. I'll note pretty much every time I mention something isn't following academic standards on LW I get ganged up on and I find it pretty weird. I've reviewed, organized, and can be senior area chair at ML conferences and know the standards well. Perhaps this response is consistent because it feels like an outside community imposing things on LW.

Context: I sometimes find myself referring back to this tweet and wanted to give it a more permanent home. While I'm at it, I thought I would try to give a concise summary of how each distinct problem would be solved by Safeguarded AI (formerly known as an Open Agency Architecture, or OAA), if it turns out to be feasible.

1. Value is fragile and hard to specify.

See: Specification gaming examples, Defining and Characterizing Reward Hacking[1]

OAA Solution:

1.1. First, instead of trying to specify "value", instead "de-pessimize" and specify the absence of a catastrophe, and maybe a handful of bounded constructive tasks like supplying clean water. A de-pessimizing OAA would effectively buy humanity some time, and freedom to experiment with less risk, for tackling the CEV-style alignment problem—which is...

There is a serious issue with your proposed solution to problem 13. Using a random dictator policy as a negotiation baseline is not suitable for the situation, where billions of humans are negotiating about the actions of a clever and powerful AI. One problem with using this solution, in this contexts, is that some people have strong commitments to moral imperatives, along the lines of ``heretics deserve eternal torture in hell''. The combination of these types of sentiments, and a powerful and clever AI (that would be very good at thinking up effective wa... (read more)

This is an entry in the 'Dungeons & Data Science' series, a set of puzzles where players are given a dataset to analyze and an objective to pursue using information from that dataset. 

STORY (skippable)

You have the excellent fortune to live under the governance of The People's Glorious Free Democratic Republic of Earth, giving you a Glorious life of Freedom and Democracy.

Sadly, your cherished values of Democracy and Freedom are under attack by...THE ALIEN MENACE!

The typical reaction of an Alien Menace to hearing about Freedom and Democracy.  (Generated using OpenArt SDXL).

Faced with the desperate need to defend Freedom and Democracy from The Alien Menace, The People's Glorious Free Democratic Republic of Earth has been forced to redirect most of its resources into the Glorious Free People's Democratic War...

2abstractapplic4h
Could you elaborate on this? I think I'd do better relative to best play with and do better relative to random play with so it's not clear which way I should lean; also, I don't know how you plan to quantify "relative to".
4aphyer4h
I'm likely not to actually quantify 'relative to' - there might be an ordered list of players if it seems reasonable to me (for example, if one submission uses 10 soldiers to get a 50% winrate and one uses 2 soldiers to get a 49% winrate, I would feel comfortable ranking the second ahead of the first - or if all players decide to submit the same number of soldiers, the rankings will be directly comparable), but more likely I'll just have a chart as in your Boojumologist scenario: with one line added for 'optimal play'  (above or equal to all players) and one for 'random play' (hopefully below all players). Overall, I don't think there's much optimization of the leaderboard/plot available to you - if you find yourself faced with a tough choice between an X% winrate with 9 soldiers or a Y% winrate with 8 soldiers, I don't anticipate the leaderboard taking a position on which of those is 'better'.

That makes sense, ty.

2abstractapplic4h
What we're facing: Relevant Weapons: Current strategies per number of soldiers: If I have to pick one strategy:

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA