This post is a not a so secret analogy for the AI Alignment problem. Via a fictional dialog, Eliezer explores and counters common questions to the Rocket Alignment Problem as approached by the Mathematics of Intentional Rocketry Institute. 

MIRI researchers will tell you they're worried that "right now, nobody can tell you how to point your rocket’s nose such that it goes to the moon, nor indeed any prespecified celestial destination."

Fabien Roger12hΩ6130
0
List sorting does not play well with few-shot mostly doesn't replicate with davinci-002. When using length-10 lists (it crushes length-5 no matter the prompt), I get: * 32-shot, no fancy prompt: ~25% * 0-shot, fancy python prompt: ~60%  * 0-shot, no fancy prompt: ~60% So few-shot hurts, but the fancy prompt does not seem to help. Code here. I'm interested if anyone knows another case where a fancy prompt increases performance more than few-shot prompting, where a fancy prompt is a prompt that does not contain information that a human would use to solve the task. This is because I'm looking for counterexamples to the following conjecture: "fine-tuning on k examples beats fancy prompting, even when fancy prompting beats k-shot prompting" (for a reasonable value of k, e.g. the number of examples it would take a human to understand what is going on).
The cost of goods has the same units as the cost of shipping: $/kg. Referencing between them lets you understand how the economy works, e.g. why construction material sourcing and drink bottling has to be local, but oil tankers exist. * An iPhone costs $4,600/kg, about the same as SpaceX charges to launch it to orbit. [1] * Beef, copper, and off-season strawberries are $11/kg, about the same as a 75kg person taking a three-hour, 250km Uber ride costing $3/km. * Oranges and aluminum are $2-4/kg, about the same as flying them to Antarctica. [2] * Rice and crude oil are ~$0.60/kg, about the same as $0.72 for shipping it 5000km across the US via truck. [3,4] Palm oil, soybean oil, and steel are around this price range, with wheat being cheaper. [3] * Coal and iron ore are $0.10/kg, significantly more than the cost of shipping it around the entire world via smallish (Handysize) bulk carriers. Large bulk carriers are another 4x more efficient [6]. * Water is very cheap, with tap water $0.002/kg in NYC. But shipping via tanker is also very cheap, so you can ship it maybe 1000 km before equaling its cost. It's really impressive that for the price of a winter strawberry, we can ship a strawberry-sized lump of coal around the world 100-400 times. [1] iPhone is $4600/kg, large launches sell for $3500/kg, and rideshares for small satellites $6000/kg. Geostationary orbit is more expensive, so it's okay for them to cost more than an iPhone per kg, but Starlink wants to be cheaper. [2] https://fred.stlouisfed.org/series/APU0000711415. Can't find numbers but Antarctica flights cost $1.05/kg in 1996. [3] https://www.bts.gov/content/average-freight-revenue-ton-mile [4] https://markets.businessinsider.com/commodities [5] https://www.statista.com/statistics/1232861/tap-water-prices-in-selected-us-cities/ [6] https://www.researchgate.net/figure/Total-unit-shipping-costs-for-dry-bulk-carrier-ships-per-tkm-EUR-tkm-in-2019_tbl3_351748799
dirk11h125
2
Sometimes a vague phrasing is not an inaccurate demarkation of a more precise concept, but an accurate demarkation of an imprecise concept
I think that people who work on AI alignment (including me) have generally not put enough thought into the question of whether a world where we build an aligned AI is better by their values than a world where we build an unaligned AI. I'd be interested in hearing people's answers to this question. Or, if you want more specific questions: * By your values, do you think a misaligned AI creates a world that "rounds to zero", or still has substantial positive value? * A common story for why aligned AI goes well goes something like: "If we (i.e. humanity) align AI, we can and will use it to figure out what we should use it for, and then we will use it in that way." To what extent is aligned AI going well contingent on something like this happening, and how likely do you think it is to happen? Why? * To what extent is your belief that aligned AI would go well contingent on some sort of assumption like: my idealized values are the same as the idealized values of the people or coalition who will control the aligned AI? * Do you care about AI welfare? Does your answer depend on whether the AI is aligned? If we built an aligned AI, how likely is it that we will create a world that treats AI welfare as important consideration? What if we build a misaligned AI? * Do you think that, to a first approximation, most of the possible value of the future happens in worlds that are optimized for something that resembles your current or idealized values? How bad is it to mostly sacrifice each of these? (What if the future world's values are similar to yours, but is only kinda effectual at pursuing them? What if the world is optimized for something that's only slightly correlated with your values?) How likely are these various options under an aligned AI future vs. an unaligned AI future?
My current main cruxes: 1. Will AI get takeover capability? When? 2. Single ASI or many AGIs? 3. Will we solve technical alignment? 4. Value alignment, intent alignment, or CEV? 5. Defense>offense or offense>defense? 6. Is a long-term pause achievable? If there is reasonable consensus on any one of those, I'd much appreciate to know about it. Else, I think these should be research priorities.

Popular Comments

Recent Discussion

We've been told by VCs and founders in the AI space that Human-level Artificial Intelligence (formerly AGI), followed by Superintelligence, will bring about a techno-utopia, if it doesn't kill us all first.

In order to fulfill that dream, AI must be sentient, and that requires it have consciousness. Today, AI is neither of those things so how do we get there from here?

Questions about AI consciousness and sentience have been discussed and debated by serious researchers, philosophers, and scientists for years; going back as far as the early sixties at RAND Corporation when MIT Professor Hubert Dreyfus turned in his report on the work of AI pioneers Herbert Simon and Allan Newell entitled "Alchemy and Artificial Intelligence."

Dreyfus believed that they spent too much time pursuing AGI and not...

This is a linkpost for On Duct Tape and Fence Posts.

Eliezer writes about fence post security. When people think to themselves "in the current system, what's the weakest point?", and then dedicate their resources to shoring up the defenses at that point, not realizing that after the first small improvement in that area, there's likely now a new weakest point somewhere else.

 

Fence post security happens preemptively, when the designers of the system fixate on the most salient aspect(s) and don't consider the rest of the system. But this sort of fixation can also happen in retrospect, in which case it manifest a little differently but has similarly deleterious effects.

Consider a car that starts shaking whenever it's driven. It's uncomfortable, so the owner gets a pillow to put...

7faul_sname2h
I'm really curious about what such fixes look like. In my experience, those edge cases tend to come about when there is some set of mutually incompatible desired properties of a system, the the mutual incompatibility isn't obvious. For example 1. We want to use standard IEEE754 floating point numbers to store our data 2. If two numbers are not equal to each other, they should not have the same string representation. 3. The sum of two numbers should have a precision no higher than the operand with the highest precision. For example, adding 0.1 + 0.2 should yield 0.3, not 0.30000000000000004. It turns out those are mutually incompatible requirements! You could say "we should drop requirement 1 and use a fixed point or fraction datatype" but that's emphatically not a one line change, and has its own places where you'll run into mutually incompatible requirements. Or you could add a "duct tape" solution like "use printf("%.2f", result) in the case where we actually ran into this problem, in which we know both operands have a 2 decimal precision, and revisit if this bug comes up again in a different context".

The sum of two numbers should have a precision no higher than the operand with the highest precision. For example, adding 0.1 + 0.2 should yield 0.3, not 0.30000000000000004.

I would argue that the precision should be capped at the lowest precision of the operands. In physics, if you add to lengths, 0.123m+0.123456m should be rounded to 0.246m.

Also, IEEE754 fundamentally does not contain information about the precision of a number. If you want to track that information correctly, you can use two floating point numbers and do interval arithmetic. There is ev... (read more)

1keltan2h
Wow, I kinda already knew this. But it had never been said so clearly and brought to the front of my mind in this way. It perfectly describes the strategies YouTube has used through its various apocalypses.

I just finished reading The Mom Test for the second time. I took "raw" notes here. In this post I'll first write up a bullet-point summary and then ramble off some thoughts that I have.

Summary

Introduction:

  • Trying to learn from customer conversations is like trying to excavate a delicate archeological site. The truth is down there somewhere, but it's fragile. When you dig you get closer to the truth, but you also risk damaging or smashing it.
  • Bad customer conversations are worse than useless because they mislead you, convincing you that you're on the right path when instead you're on the wrong path.
  • People talk to customers all the time, but they still end up building the wrong things. How is this possible? Almost no one talks to customers correctly.
  • Why another
...

Hm, maybe.

Sometimes it can be a win-win situation. For example, if the call leads to you identifying a problem they're having and solving it in a mutually beneficial way.

But often times that isn't the case. From their perspective, the chances are low enough where, yeah, maybe the cold call just feels spammy and annoying.

I think that cold calls can be worthwhile from behind a veil of ignorance though. That's the barometer I like to use. If I were behind a veil of ignorance, would I endorse the cold call? Some cold calls are well targeted and genuine, in which case I would endorse them from behind a veil of ignorance. Others are spammy and thoughtless, in which case I wouldn't endorse them.

Concerns over AI safety and calls for government control over the technology are highly correlated but they should not be.

There are two major forms of AI risk: misuse and misalignment. Misuse risks come from humans using AIs as tools in dangerous ways. Misalignment risks arise if AIs take their own actions at the expense of human interests.

Governments are poor stewards for both types of risk. Misuse regulation is like the regulation of any other technology. There are reasonable rules that the government might set, but omission bias and incentives to protect small but well organized groups at the expense of everyone else will lead to lots of costly ones too. Misalignment regulation is not in the Overton window for any government. Governments do not have strong incentives...

4Matthew Barnett1h
Makes sense. By comparison, my own unconditional estimate of p(doom) is not much higher than 10%, and so it's hard on my view for any intervention to have a double-digit percentage point effect. The crude mortality rate before the pandemic was about 0.7%. If we use that number to estimate the direct cost of a 1-year pause, then this is the bar that we'd need to clear for a pause to be justified. I find it plausible that this bar could be met, but at the same time, I am also pretty skeptical of the mechanisms various people have given for how a pause will help with AI safety.
2Daniel Kokotajlo45m
I agree that 0.7% is the number to beat for people who mostly focus on helping present humans and who don't take acausal or simulation argument stuff or cryonics seriously. I think that even if I was much more optimistic about AI alignment, I'd still think that number would be fairly plausibly beaten by a 1-year pause that begins right around the time of AGI.  What are the mechanisms people have given and why are you skeptical of them?

(Surely cryonics doesn't matter given a realistic action space? Usage of cryonics is extremely rare and I don't think there are plausible (cheap) mechanisms to increase uptake to >1% of population. I agree that simulation arguments and similar considerations maybe imply that "helping current humans" is either incoherant or unimportant.)

3Amalthea1h
Somewhat of a nitpick, but the relevant number would be p(doom | strong AGI being built) (maybe contrasted with p(utopia | strong AGI)) , not overall p(doom).

(Half-baked work-in-progress. There might be a “version 2” of this post at some point, with fewer mistakes, and more neuroscience details, and nice illustrations and pedagogy etc. But it’s fun to chat and see if anyone has thoughts.)

1. Background

There’s a neuroscience problem that’s had me stumped since almost the very beginning of when I became interested in neuroscience at all (as a lens into AGI safety) back in 2019. But I think I might finally have “a foot in the door” towards a solution!

What is this problem? As described in my post Symbol Grounding and Human Social Instincts, I believe the following:

...

If step 5 is indeed grounded in the spatial attention being on other people, this should be testable! For example, people who pay less spatial attention to other people should feel less intense social emotions - because the steering system circuit gets activated less often and weaker. And I think that is the case. At least ChatGPT has some confirming evidence, though it's not super clear and I haven't yet looked deeper into it.  

2Gunnar_Zarncke1h
The vestibular system can detect whether you look up or down. It could be that the reflex triggers when you a) look down (vestibular system) and b) have a visual parallax that indicates depth (visual system). Should be easy to test by closing one eye. Alternatively, it is the degree of accommodation of the lens. That should be testable by looking down with a lens that forces accommodation on short distances. The negative should also be testable by asking congenitally blind people about their experience with this feeling of dizziness close to a rim.
6interstice5h
Tangentially related: some advanced meditators report that their sense that perception has a center vanishes at a certain point along the meditative path, and this is associated with a reduction in suffering.
3Carl Feynman5h
You write: …But I think people can be afraid of heights without past experience of falling… I have seen it claimed that crawling-age babies are afraid of heights, in that they will not crawl from a solid floor to a glass platform over a yawning gulf.  And they’ve never fallen into a yawning gulf.  At that age, probably all the heights they’ve fallen from have been harmless, since the typical baby is both bouncy and close to the ground.

With my electronic harp mandolin project I've been enjoying working with analog and embedded audio hardware. And a few weeks ago, after reading about Ugo Conti's whistle-controlled synth I wrote to him, he gave me a call, and we had a really interesting conversation. And my existing combination of hardware for my whistle synth [1] is bulky and expensive. Which has me excited about a new project: I'd like to make an embedded version.

Yesterday I got started on the first component: getting audio into the microcontroller. I want to start with a standard dynamic mic, so I can keep using the same mic for talkbox and whistle synth, so it should take standard balanced audio on XLR as input. In a full version this would need an XLR port, but for now I...

4Richard_Kennaway5h
You want those in parallel for them to add. The series combination (which I see in the breadboard pic, not just the text) is only 2µF, making your high-pass frequency a little over 10kHz.
jefftk34m40

Whoops! You're right! Will do.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with
1Johannes C. Mayer5h
Research Writing Workflow: First figure stuff out * Do research and first figure stuff out, until you feel like you are not confused anymore. * Explain it to a person, or a camera, or ideally to a person and a camera. * If there are any hiccups expand your understanding. * Ideally, as the last step, explain it to somebody whom you have not ever explained it to. * Only once you made a presentation without hiccups you are ready to write post. * If you have a recording this is useful as a starting point.

I like the rough thoughts way though. I'm not here to like read a textbook.

2Carl Feynman11h
I would highly recommend getting someone else to debug your subconscious for you.  At least it worked for me.  I don’t think it would be possible for me to have debugged myself.   My first therapist was highly directive.  He’d say stuff like “Try noticing when you think X, and asking yourself what happened immediately before that.  Report back next week.” And listing agenda items and drawing diagrams on a whiteboard.  As an engineer, I loved it.  My second therapist was more in the “providing supportive comments while I talk about my life” school.  I don’t think that helped much, at least subjectively from the inside. Here‘s a possibly instructive anecdote about my first therapist.  Near the end of a session, I feel like my mind has been stretched in some heretofore-unknown direction.  It’s a sensation I’ve never had before.  So I say, “Wow, my mind feels like it’s been stretched in some heretofore-unknown direction.  How do you do that?”  He says, “Do you want me to explain?”  And I say, “Does it still work if I know what you’re doing?”  And he says, “Possibly not, but it’s important you feel I’m trustworthy, so I’ll explain if you want.”  So I say “Why mess with success?  Keep doing the thing. I trust you.”  That’s an example of a debugging procedure you can’t do to yourself.

Epistemic Status: Musing and speculation, but I think there's a real thing here.

I.

When I was a kid, a friend of mine had a tree fort. If you've never seen such a fort, imagine a series of wooden boards secured to a tree, creating a platform about fifteen feet off the ground where you can sit or stand and walk around the tree. This one had a rope ladder we used to get up and down, a length of knotted rope that was tied to the tree at the top and dangled over the edge so that it reached the ground. 

Once you were up in the fort, you could pull the ladder up behind you. It was much, much harder to get into the fort without the ladder....

This is a linkpost for https://arxiv.org/abs/2404.16014

Authors: Senthooran Rajamanoharan*, Arthur Conmy*, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

A new paper from the Google DeepMind mech interp team: Improving Dictionary Learning with Gated Sparse Autoencoders! 

Gated SAEs are a new Sparse Autoencoder architecture that seems to be a significant Pareto-improvement over normal SAEs, verified on models up to Gemma 7B. They are now our team's preferred way to train sparse autoencoders, and we'd love to see them adopted by the community! (Or to be convinced that it would be a bad idea for them to be adopted by the community!)

They achieve similar reconstruction with about half as many firing features, and while being either comparably or more interpretable (confidence interval for the increase is 0%-13%).

See Sen's Twitter summary, my Twitter summary, and the paper!

We use learning rate 0.0003 for all Gated SAE experiments, and also the GELU-1L baseline experiment. We swept for optimal baseline learning rates on GELU-1L for the baseline SAE to generate this value. 

For the Pythia-2.8B and Gemma-7B baseline SAE experiments, we divided the L2 loss by , motivated by wanting better hyperparameter transfer, and so changed learning rate to 0.001 or 0.00075 for all the runs (currently in Figure 1, only attention output pre-linear uses 0.00075. In the rerelease we'll state all the values used). We didn't see n... (read more)

11Senthooran Rajamanoharan12h
UPDATE: we've corrected equations 9 and 10 in the paper (screenshot of the draft below) and also added a footnote that hopefully helps clarify the derivation. I've also attached a revised figure 6, showing that this doesn't change the overall story (for the mathematical reasons I mentioned in my previous comment). These will go up on arXiv, along with some other minor changes (like remembering to mention SAEs' widths), likely some point next week. Thanks again Sam for pointing this out! Updated equations (draft): Updated figure 6 (shrinkage comparison for GELU-1L):
4Dan Braun16h
This is neat, nice work! I'm finding it quite hard to get a sense at what the actual Loss Recovered numbers you report are, and to compare them concretely to other work. If possible, it'd be very helpful if you shared: 1. What the zero ablations CE scores are for each model and SAE position. (I assume it's much worse for the MLP and attention outputs than the residual stream?) 2. What the baseline CE scores are for each model.

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA