This post is a not a so secret analogy for the AI Alignment problem. Via a fictional dialog, Eliezer explores and counters common questions to the Rocket Alignment Problem as approached by the Mathematics of Intentional Rocketry Institute. 

MIRI researchers will tell you they're worried that "right now, nobody can tell you how to point your rocket’s nose such that it goes to the moon, nor indeed any prespecified celestial destination."

Eric Neyman19h25-2
6
I think that people who work on AI alignment (including me) have generally not put enough thought into the question of whether a world where we build an aligned AI is better by their values than a world where we build an unaligned AI. I'd be interested in hearing people's answers to this question. Or, if you want more specific questions: * By your values, do you think a misaligned AI creates a world that "rounds to zero", or still has substantial positive value? * A common story for why aligned AI goes well goes something like: "If we (i.e. humanity) align AI, we can and will use it to figure out what we should use it for, and then we will use it in that way." To what extent is aligned AI going well contingent on something like this happening, and how likely do you think it is to happen? Why? * To what extent is your belief that aligned AI would go well contingent on some sort of assumption like: my idealized values are the same as the idealized values of the people or coalition who will control the aligned AI? * Do you care about AI welfare? Does your answer depend on whether the AI is aligned? If we built an aligned AI, how likely is it that we will create a world that treats AI welfare as important consideration? What if we build a misaligned AI? * Do you think that, to a first approximation, most of the possible value of the future happens in worlds that are optimized for something that resembles your current or idealized values? How bad is it to mostly sacrifice each of these? (What if the future world's values are similar to yours, but is only kinda effectual at pursuing them? What if the world is optimized for something that's only slightly correlated with your values?) How likely are these various options under an aligned AI future vs. an unaligned AI future?
The cost of goods has the same units as the cost of shipping: $/kg. Referencing between them lets you understand how the economy works, e.g. why construction material sourcing and drink bottling has to be local, but oil tankers exist. * An iPhone costs $4,600/kg, about the same as SpaceX charges to launch it to orbit. [1] * Beef, copper, and off-season strawberries are $11/kg, about the same as a 75kg person taking a three-hour, 250km Uber ride costing $3/km. * Oranges and aluminum are $2-4/kg, about the same as flying them to Antarctica. [2] * Rice and crude oil are ~$0.60/kg, about the same as $0.72 for shipping it 5000km across the US via truck. [3,4] Palm oil, soybean oil, and steel are around this price range, with wheat being cheaper. [3] * Coal and iron ore are $0.10/kg, significantly more than the cost of shipping it around the entire world via smallish (Handysize) bulk carriers. Large bulk carriers are another 4x more efficient [6]. * Water is very cheap, with tap water $0.002/kg in NYC. But shipping via tanker is also very cheap, so you can ship it maybe 1000 km before equaling its cost. [1] iPhone is $4600/kg, large launches sell for $3500/kg, and rideshares for small satellites $6000/kg. Geostationary orbit is more expensive, so it's okay for them to cost more than an iPhone per kg, but Starlink has to be cheaper. [2] https://fred.stlouisfed.org/series/APU0000711415. Can't find numbers but Antarctica flights cost $1.05/kg in 1996. [3] https://www.bts.gov/content/average-freight-revenue-ton-mile [4] https://markets.businessinsider.com/commodities [5] https://www.statista.com/statistics/1232861/tap-water-prices-in-selected-us-cities/ [6] https://www.researchgate.net/figure/Total-unit-shipping-costs-for-dry-bulk-carrier-ships-per-tkm-EUR-tkm-in-2019_tbl3_351748799
Roman Mazurenko is dead again. First resurrected person, Roman lived as a chatbot (2016-2024) created based on his conversations with his fiancé. You might even be able download him as an app.  But not any more. His fiancé married again and her startup http://Replika.ai pivoted from resurrection help to AI-girlfriends and psychological consulting.  It looks like they quietly removed Roman Mazurenko app from public access. It is especially pity that his digital twin lived less than his biological original, who died at 32. Especially now when we have much more powerful instruments for creating semi-uploads based on LLMs with large prompt window.
quila31m10
1
i'm watching Dominion again to remind myself of the world i live in, to regain passion to Make It Stop it's already working.
Are people in rich countries happier on average than people in poor countries? (According to GPT-4, the academic consensus is that it does, but I'm not sure it's representing it correctly.) If so, why do suicide rates increase (or is that a false positive)? Does the mean of the distribution go up while the tails don't or something?

Popular Comments

Recent Discussion

1quila31m
i'm watching Dominion again to remind myself of the world i live in, to regain passion to Make It Stop it's already working.
quila7m10

when i was younger, pre-rationalist, i tried to go on hunger strike to push my abusive parent to stop funding this.

they agreed to watch this as part of a negotiation. they watched part of it.

they changed their behavior slightly -- as a negotiation -- for about a month.

they didn't care.

they looked horror in the eye. they didn't flinch. they saw themself in it.

This is a linkpost for https://arxiv.org/abs/2404.16014

Authors: Senthooran Rajamanoharan*, Arthur Conmy*, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

A new paper from the Google DeepMind mech interp team: Improving Dictionary Learning with Gated Sparse Autoencoders! 

Gated SAEs are a new Sparse Autoencoder architecture that seems to be a significant Pareto-improvement over normal SAEs, verified on models up to Gemma 7B. They are now our team's preferred way to train sparse autoencoders, and we'd love to see them adopted by the community! (Or to be convinced that it would be a bad idea for them to be adopted by the community!)

They achieve similar reconstruction with about half as many firing features, and while being either comparably or more interpretable (confidence interval for the increase is 0%-13%).

See Sen's Twitter summary, my Twitter summary, and the paper!

fvncc19m10

Hi any idea how this would compare to just replacing the loss with a smoothed loss function? Something like (summed across the sparse representation).

6Neel Nanda1h
I haven't fully worked through the maths, but I think both IG and attribution patching break down here? The fundamental problem is that the discontinuity is invisible to IG because it only takes derivatives. Eg the ReLU and Jump ReLU below look identical from the perspective of IG, but not from the perspective of activation patching, I think.
2Sam Marks29m
Yep, you're totally right -- thanks!
1jacobcd521h
Nice work! I'm not sure I fully understand what the "gated-ness" is adding, i.e. what the role the Heaviside step function is playing. What would happen if we did away with it? Namely, consider this setup: Let f and ^x  be the encoder and decoder functions, as in your paper, and let x be the model activation that is fed into the SAE. The usual SAE reconstruction is ^x(f(x)), which suffers from the shrinkage problem. Now, introduce a new learned parameter t∈Rnfeatures, and define an "expanded" reconstruction yexpanded=^x(t⊙f(x)), where ⊙ denotes elementwise multiplication. Finally, take the loss to be: L=||^xcopy(f(x))−x||22+||yexpanded−x||22+λ||f(x)||1. where ^xcopy ensures the decoder gets no gradients from the first term. As I understand it, this is exactly the loss appearing in your paper. The only difference in the setup is the lack of the Heaviside step function. Did you try this setup? Or does it fail for an obvious reason I missed?
2Rafael Harth2h
Are people in rich countries happier on average than people in poor countries? (According to GPT-4, the academic consensus is that it does, but I'm not sure it's representing it correctly.) If so, why do suicide rates increase (or is that a false positive)? Does the mean of the distribution go up while the tails don't or something?

People in rich countries are happier than people in poor countries generally (this is both people who say they are "happy" or "very happy", and self-reported life satisfaction), see many of the graphs here https://ourworldindata.org/happiness-and-life-satisfaction

In general it seems like richer countries also have lower suicide rates: "for every 1000 US dollar increase in the GDP per capita, suicide rates are reduced by 2%

Warning: This post might be depressing to read for everyone except trans women. Gender identity and suicide is discussed. This is all highly speculative. I know near-zero about biology, chemistry, or physiology. I do not recommend anyone take hormones to try to increase their intelligence; mood & identity are more important.

Why are trans women so intellectually successful? They seem to be overrepresented 5-100x in eg cybersecurity twitter, mathy AI alignment, non-scam crypto twitter, math PhD programs, etc.

To explain this, let's first ask: Why aren't males way smarter than females on average? Males have ~13% higher cortical neuron density and 11% heavier brains (implying   more area?). One might expect males to have mean IQ far above females then, but instead the means and medians are similar:

Left. Right.

My theory...

1metachirality6h
Copied from a reply on lukehmiles' short form: If it is related to IQ, however, this is less plausible, although perhaps some sort of selection effect is happening here.
2interstice11h
The post is about the performance gap of trans women over men, not women.
kromem29m10

It implicitly does compare trans women to other women in talking about the performance similarity between men and women:

"Why aren't males way smarter than females on average? Males have ~13% higher cortical neuron density and 11% heavier brains (implying 1.112/3−1=7% more area?). One might expect males to have mean IQ far above females then, but instead the means and medians are similar"

So OP is saying "look, women and men are the same, but trans women are exceptional."

I'm saying that identifying the exceptionality of trans women ignores the environmental... (read more)

3quetzal_rainbow12h
Whoops, it's really looks like I imagined this claim to be backed more than by one SSC post. In my defense I say that this poll covered really existing thing like abnormal illusions processing in schizophrenics (see "Systematic review of visual illusions schizophrenia" Costa et al., 2023) and I think it's overall plausible. My general objections stays the same: there is a bazillion sources on brain differences in transgender individuals, transgenderism is likely to be a brain anomaly, we don't need to invoke "testosterone damage" hypothesis.

Crosspost from my blog.  

If you spend a lot of time in the blogosphere, you’ll find a great deal of people expressing contrarian views. If you hang out in the circles that I do, you’ll probably have heard of Yudkowsky say that dieting doesn’t really work, Guzey say that sleep is overrated, Hanson argue that medicine doesn’t improve health, various people argue for the lab leak, others argue for hereditarianism, Caplan argue that mental illness is mostly just aberrant preferences and education doesn’t work, and various other people expressing contrarian views. Often, very smart people—like Robin Hanson—will write long posts defending these views, other people will have criticisms, and it will all be such a tangled mess that you don’t really know what to think about them.

For...

Hmm, this sounds like an awfully contrarian take to me.

2Matthew Barnett1h
I'm broadly sympathetic to this post. I think a lot of people adjacent to the LessWrong cluster tend to believe contrarian claims on the basis of flimsy evidence. That said, I am fairly confident that Scott Alexander misrepresented Robin Hanson's position on medicine in that post, as I pointed out in my comment here. So, I'd urge you not to update too far on this particular question, at least until Hanson has responded to the post. (However, I do think Robin Hanson has stated his views on this topic in a confusing way that reliably leads to misinterpretation.)
3Jacob G-W2h
I think I've noticed some sort of cognitive bias in myself and others where we are naturally biased towards "contrarian" or "secret" views because it feels good to know something that others don't know / be right about something that so many people are wrong about. Does this bias have a name? GPT4 says it's the Illusion of asymmetric insight, which I'm not sure is the same thing (I think it is the more general term, whereas I'm looking for one specific to contrarian views). Interestingly, it only has one hit on lesswrong. I think more people should know about this (the specific one about contrarianism) since it seems fairly common.
1Jacob G-W2h
Thank you for writing this! It expresses in a clear way a pattern that I've seen in myself: I eagerly jump into contrarian ideas because it feels "good" and then slowly get out of them as I start to realize they are not true.

People have been posting great essays so that they're "fed through the standard LessWrong algorithm." This essay is in the public domain in the UK but not the US.


From a very early age, perhaps the age of five or six, I knew that when I grew up I should be a writer. Between the ages of about seventeen and twenty-four I tried to abandon this idea, but I did so with the consciousness that I was outraging my true nature and that sooner or later I should have to settle down and write books.

I was the middle child of three, but there was a gap of five years on either side, and I barely saw my father before I was eight. For this and other reasons I...

trevor1h20

If math education was better at the time (or today, for that matter) he probably would have had an even more general skillset and thought process. 

Probably not nearly to the degree of Von Neumann, of course, but I still like to think about what he would have achieved. There were probably many things that were instrumentally convergent (e.g. a formalized concept of instrumental convergence that's universal for all mind configurations, instead of just all human cultures which he explored substantially).

6cousin_it5h
Orwell is one of my personal heroes, 1984 was a transformative book to me, and I strongly recommend Homage to Catalonia as well. That said, I'm not sure making theories of art is worth it. Even when great artists do it (Tolkien had a theory of art, and Oscar Wilde, and Flannery O'Connor, and almost every artist if you look close enough), it always seems to be the kind of theory which suits that artist and nobody else. Would advice like "good prose is like a windowpane" or "efface your own personality" improve the writing of, say, Hunter S. Thompson? Heck no, his writing is the opposite of that and charming for it! Maybe the only possible advice to an artist is to follow their talent, and advising anything more specific is as likely to hinder as help.
2Viliam3h
The theories are probably just rationalizations anyway.
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

About a year ago I decided to try using one of those apps where you tie your goals to some kind of financial penalty. The specific one I tried is Forfeit, which I liked the look of because it’s relatively simple, you set single tasks which you have to verify you have completed with a photo.

I’m generally pretty sceptical of productivity systems, tools for thought, mindset shifts, life hacks and so on. But this one I have found to be really shockingly effective, it has been about the biggest positive change to my life that I can remember. I feel like the category of things which benefit from careful planning and execution over time has completely opened up to me, whereas previously things like this would be largely down to the...

1quiet_NaN7h
In the subagent view, a financial precommitment another subagent has arranged for the sole purpose of coercing you into one course of action is a threat.  Plenty of branches of decision theory advise you to disregard threats because consistently doing so will mean that instances of you will more rarely find themselves in the position to be threatened. Of course, one can discuss how rational these subagents are in the first place. The "stay in bed, watch netflix and eat potato chips" subagent is probably not very concerned with high level abstract planning and might have a bad discount function for future benefits and not be overall that interested in the utility he get from being principled.
1quiet_NaN8h
To whomever overall-downvoted this comment, I do not think that this is a troll.  Being a depressed person, I can totally see this being real. Personally, I would try to start slow with positive reinforcement. If video games are the only thing which you can get yourself to do, start there. Try to do something intellectually interesting in them. Implement a four bit adder in dwarf fortress using cat logic. Play KSP with the Principia mod. Write a mod for a game. Use math or Monte Carlo simulations to figure out the best way to accomplish something in a video game even if it will take ten times longer than just taking a non-optimal route. Some of my proudest intellectual accomplishments are in projects which have zero bearing on the real world.  (Of course, I am one to talk right now. Spending five hours playing Rimworld in a not-terrible-clever way for every hour I work on my thesis.)

My depression is currently well-controlled at the moment, and I actually have found various methods to help me get things done, since I don't respond well to the simplest versions of carrot-and-stick methods. The most pleasant is finding someone else to do it with me (or at least act involved while I do the actual work).

On the other hand, there have been times when procrastinating actually gives me a thrill, like I'm getting away with something. Mediocre video games become much more appealing when I have work to avoid.

1dreeves17h
I think this is a persuasive case that commitment devices aren't good for you. I'm very interested in how common this is, and if there's a way you could reframe commit devices to avoid this psychological reaction to them. One idea is to focus on incentive alignment that avoids the far end of the spectrum. With Beeminder in particular, you could set a low pledge cap and then focus on the positive reinforcement of keeping your graph pretty by keeping the datapoints on the right side of the red line.

U.S. Secretary of Commerce Gina Raimondo announced today additional members of the executive leadership team of the U.S. AI Safety Institute (AISI), which is housed at the National Institute of Standards and Technology (NIST). Raimondo named Paul Christiano as Head of AI Safety, Adam Russell as Chief Vision Officer, Mara Campbell as Acting Chief Operating Officer and Chief of Staff, Rob Reich as Senior Advisor, and Mark Latonero as Head of International Engagement. They will join AISI Director Elizabeth Kelly and Chief Technology Officer Elham Tabassi, who were announced in February. The AISI was established within NIST at the direction of President Biden, including to support the responsibilities assigned to the Department of Commerce under the President’s landmark Executive Order.

Paul Christiano, Head of AI Safety, will design

...
2Davidmanheim6h
But what? Should we insist that the entire time someone's inside a BSL-4 lab, we have a second person who is an expert in biosafety visually monitoring them to ensure they don't make mistakes? Or should their air supply not use filters and completely safe PAPRs, and feed them outside air though a tube that restricts their ability to move around instead?  Or do you have some new idea that isn't just a ban with more words?   Sure, list-based approaches are insufficient, but they have relatively little to do with biosafety levels of labs, they have to do with risk groups, which are distinct, but often conflated. (So Ebola or Smallpox isn't a "BSL-4" pathogen, because there is no such thing. ) That ban didn't go far enough, since it only applied to 3 pathogen types, and wouldn't have banned what Wuhan was doing with novel viruses, since that wasn't working with SARS or MERS, it was working with other species of virus. So sure, we could enforce a broader version of that ban, but getting a good definition that's both extensive enough to prevent dangerous work and that doesn't ban obviously useful research is very hard.
2Davidmanheim7h
Having written extensively about it, I promise you I'm aware. But please, tell me more about how this supports the original claim which I have been disagreeing with, that these class of incidents were or are the primary concern of the EA biosecurity community, the one that led to it being a cause area.
aysja2h20

I agree there other problems the EA biosecurity community focuses on, but surely lab escapes are one of those problems, and part of the reason we need biosecurity measures? In any case, this disagreement seems beside the main point that I took Adam to be making, namely that the track record for defining appropriate units of risk for poorly understood, high attack surface domains is quite bad (as with BSL). This still seems true to me.   

The history of science has tons of examples of the same thing being discovered multiple time independently; wikipedia has a whole list of examples here. If your goal in studying the history of science is to extract the predictable/overdetermined component of humanity's trajectory, then it makes sense to focus on such examples.

But if your goal is to achieve high counterfactual impact in your own research, then you should probably draw inspiration from the opposite: "singular" discoveries, i.e. discoveries which nobody else was anywhere close to figuring out. After all, if someone else would have figured it out shortly after anyways, then the discovery probably wasn't very counterfactually impactful.

Alas, nobody seems to have made a list of highly counterfactual scientific discoveries, to complement wikipedia's list of multiple discoveries.

To...

The RLCT = first-order term for in-distribution generalization error
 

Clarification: The 'derivation' for how the RLCT predicts generalization error IIRC goes through the same flavour of argument as the one the derivation of the vanilla Bayesian Information Criterion uses. I don't like this derivation very much. See e.g. this one on Wikipedia. 

So what it's actually showing is just that:

  1. If you've got a class of different hypotheses , containing many individual hypotheses  .
  2. And you've got a prior ahead of time that s
... (read more)
1cubefox4h
What's more likely: You being wrong about the obviousness of the sphere Earth theory to sailors, or the entire written record (which included information from people who had extensive access to the sea) of two thousand years of Chinese history and astronomy somehow ommitting the spherical Earth theory? Not to speak of other pre-Hellenistic seafaring cultures which also lack records of having discovered the sphere Earth theory.
4Lucius Bushnaq5h
It's measuring the volume of points in parameter space with loss <ϵ when ϵ is infinitesimal.  This is slightly tricky because it doesn't restrict itself to bounded parameter spaces,[1] but you can fix it with a technicality by considering how the volume scales with ϵ instead. In real networks trained with finite amounts of data, you care about the case where ϵ is small but finite, so this is ultimately inferior to just measuring how many configurations of floating point numbers get loss <ϵ, if you can manage that. I still think SLT has some neat insights that helped me deconfuse myself about networks. For example, like lots of people, I used to think you could maybe estimate the volume of basins with loss <ϵ using just the eigenvalues of the Hessian. You can't. At least not in general.    1. ^ Like the floating point numbers in a real network, which can only get so large. A prior of finite width over the parameters also effectively bounds the space
4tailcalled6h
Yes.

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA