Summary

I think I expect Earth in this case to just say no and not sell the sun? But I was confused at like 2 points in your paragraph so I don't think I understand what you're saying that well. I also think we're probably on mostly the same page, and am not that interested in hashing out further potential disagreements.

Also, mostly unrelated, maybe a hot take, but if you're able to get outcompeted because you don't upload, then the future you're in is not very good.

Replying toOn Eating the Sun

Cool. I misinterpreted your previous comment and think we're basically on the same page.

Replying toOn Eating the Sun

I think the majority of humans probably won't want to be uploads, leave the solar system permanently, etc. Maybe this is where we disagree? I don't really think there's going to be a thing that most people care about more.

Replying toOn Eating the Sun

I don't think that's a very good analogy, but I will say that is is basically true for the Amish. And I do think that we should respect their preferences. (I seperately think cars are not that good, and that people would infact prefer to bicycle around or ride house drawn carriges or whatever if civilization was conducive to that, although that's kinda besides the point.)

I'm not arguing that we should be conservative about changing the sun. I'm just claiming that people like the sun and won't want to see it eaten/fundamentally transformed, and that we should respect this preference. This is reason why it's different from candles -> lightbulbs, because people... (read more)

Replying toOn Eating the Sun

I am claiming that people when informed will want the sun to continuing being the sun. I also think that most people when informed will not really care that much about creating new people, will continue to believe in the act-omission distinction, etc. And that this is a coherent view that will add up to a large set of people wanting things in the solar system to remain conservatively the same. I seperately claim that if this is true, then other people should just respect this preference, and use the other stars that people don't care about for energy.

Replying toOn Eating the Sun

But most people on Earth don't want "an artificial system to light the Earth in such a way as to mimic the sun", they want the actual sun to go on existing.

from https://time.com/6961068/sam-bankman-fried-prison-sentence/

This is in part the reasoning used by Judge Kaplan:

Kaplan himself said on Thursday that he decided on his sentence in part to make sure that Bankman-Fried cannot harm other people going forward. “There is a risk that this man will be in a position to do something very bad in the future,” he said. “In part, my sentence will be for the purpose of disabling him, to the extent that can appropriately be done, for a significant period of time.”

It's kind of strange that, from my perspective, these mistakes are very similar to the mistakes I think I made, and also see a lot of other people making. Perhaps one "must" spend too long doing abstract slippery stuff to really understand the nature of why it doesn't really work that well?

Estimating Tail Risk in Neural Networks

I know what the word means, I just think in typical cases people should be saying a lot more about why something is undignified, because I don’t think people’s senses of dignity typically overlap that much, especially if the reader doesn’t typically read LW. In these cases I think permitting the use of the word “undignified” prevents specificity.

"Undignified" is really vague

I sometimes see/hear people say that "X would be a really undignified". I mostly don't really know what this means? I think it means "if I told someone that I did X, I would feel a bit embarassed." It's not really an argument against X. It's not dissimilar to saying "vibes are off with X".

Not saying you should never say it, but basically every use I see could/should be replaced with something more specific.

•••

AI safety researchers might be allocated too heavily to Anthropic compared to Google Deepmind

Some considerations:

Safety researchers should want Google Deepmind (GDM) to have a robust and flourishing safety department. It seems plausible that GDM will be able to create "the smartest" models: they have lots of talent, and own lots of computers. (see e.g. https://epochai.org/data/notable-ai-models#computing-capacity)
Anthropic (ANT) might run into trouble in the future due to not owning their own computers, e.g. if Amazon (or where ever they're renting their computers from) starts their own internal scaling competitor, and decides to stop renting out most of their compute.
ANT has a stronger safety culture, and so it is a more pleasant experience to work at

... (read more)

•••

Alignment researchers should think hard about switching to working on AI Control

I think Redwood Research’s recent work on AI control really “hits it out of the park”, and they have identified a tractable and neglected intervention that can make AI go a lot better. Obviously we should shift labor until the marginal unit of research in either area decreases P(doom) by the same amount. I think that implies lots of alignment researchers should shift to AI control type work, and would naively guess that the equilibrium is close to 50/50 across people who are reading this post. That means if you’re working on alignment and reading this, I think there’s probably a ~45% chance... (read 823 more words →)

•••

Backdoors as an analogy for deceptive alignment

Machine learning systems are typically trained to maximize average-case performance. However, this method of training can fail to meaningfully control the probability of tail events that might cause significant harm. For instance, while an artificial intelligence (AI) assistant may be generally safe, it would be catastrophic if it ever suggested an action that resulted in unnecessary large-scale harm.

Current techniques for estimating the probability of tail events are based on finding inputs on which an AI behaves catastrophically. Since the input space is so large, it might be prohibitive to search through it thoroughly enough to detect all potential catastrophic behavior. As a result, these techniques cannot be used to produce AI systems... (read 6753 more words →)

Jacob_Hilton

Jacob_Hilton, Mark Xu

ARC has released a paper on Backdoor defense, learnability and obfuscation in which we study a formal notion of backdoors in ML models. Part of our motivation for this is an analogy between backdoors and deceptive alignment, the possibility that an AI system would intentionally behave well in training in order to give itself the opportunity to behave uncooperatively later. In our paper, we prove several theoretical results that shed some light on possible mitigations for deceptive alignment, albeit in a way that is limited by the strength of this analogy.

In this post, we will:

Lay out the analogy between backdoors and deceptive alignment
Discuss prior theoretical results from the perspective of this analogy
Explain

... (read 2238 more words →)

104

If you weren't such an idiot...

kave

kave, Mark Xu

My friend Buck once told me that he often had interactions with me that felt like I was saying “If you weren’t such a fucking idiot, you would obviously do…” Here’s a list of such advice in that spirit.
Note that if you do/don’t do these things, I’m technically calling you an idiot, but I do/don’t do a bunch of them too. We can be idiots together.
If you weren’t such a fucking idiot…
You would have multiple copies of any object that would make you sad if you didn’t have it
Examples: ear plugs, melatonin, eye masks, hats, sun glasses, various foods, possibly computers, etc.
You would spend money on goods and services.
Examples of goods: faster

... (read 472 more words →)

175

•••

ARC is hiring theoretical researchers

How to do theoretical research, a personal perspective

paulfchristiano, Jacob_Hilton, Mark Xu

The Alignment Research Center’s Theory team is starting a new hiring round for researchers with a theoretical background. Please apply here.

Update January 2024: we have paused hiring and expect to reopen in the second half of 2024. We are open to expressions of interest but do not plan to process them until that time.

What is ARC’s Theory team?

The Alignment Research Center (ARC) is a non-profit whose mission is to align future machine learning systems with human interests. The high-level agenda of the Theory team (not to be confused with the Evals team) is described by the report on Eliciting Latent Knowledge (ELK): roughly speaking, we’re trying to design ML training objectives that... (read 1083 more words →)

126

"Where do new [algorithms] come from? I keep reading about someone who invented [an algorithm] to do something-or-other but there's no mention of how."
A shrug of robed shoulders. "Where do new books come from, Mr. Potter? Those who read many books sometimes become able to write them in turn. How? No one knows."
"There are books on how to write -"
"Reading them will not make you a famous playwright. After all such advice is accounted for, what remains is mystery. The invention of new [algorithms] is a similar mystery of purer form."
- Harry Potter and the Methods of Rationality

Most of the content in this document came out of extensive conversations with Paul Christiano.

(This... (read 4343 more words →)

ELK prize results

ELK First Round Contest Winners

paulfchristiano, Mark Xu

From January - February the Alignment Research Center offered prizes for proposed algorithms for eliciting latent knowledge. In total we received 197 proposals and are awarding 32 prizes of $5k-20k. We are also giving 24 proposals honorable mentions of $1k, for a total of $274,000.

Several submissions contained perspectives, tricks, or counterexamples that were new to us. We were quite happy to see so many people engaging with ELK, and we were surprised by the number and quality of submissions. That said, at a high level most of the submissions explored approaches that we have also considered; we underestimated how much convergence there would be amongst different proposals.

In the rest of this post we’ll present... (read 6219 more words →)

138

Mark Xu, paulfchristiano

Thank you to all those who have submitted proposals to the ELK proposal competition. We have evaluated all proposals submitted before January 14th^[1]. Decisions are still being made on proposals submitted after January 14th.

The deadline for submissions is February 15th, after which we will release summaries of the proposals and associated counterexamples.

We evaluated 30 distinct proposals from 25 people. We awarded a total of $70,000 for proposals from the following 8 people:

$20,000 for a proposal from Sam Marks
$10,000 for a proposal from Dmitrii Krasheninnikov
$5,000 for a proposal from Maria Shakhova
$10,000 for proposals from P, who asked to remain anonymous
$10,000 for a proposal from Scott Viteri
$5,000 for a proposal from Jacob Hilton and

paulfchristiano, Mark Xu, Ajeya Cotra

ARC has published a report on Eliciting Latent Knowledge, an open problem which we believe is central to alignment. We think reading this report is the clearest way to understand what problems we are working on, how they fit into our plan for solving alignment in the worst case, and our research methodology.

The core difficulty we discuss is learning how to map between an AI’s model of the world and a human’s model. This is closely related to ontology identification (and other similar statements). Our main contribution is to present many possible approaches to the problem and a more precise discussion of why it seems to be difficult and important.

The report is... (read more)

230

ARC is hiring!

Your Time Might Be More Valuable Than You Think

paulfchristiano, Mark Xu

The Alignment Research Center is hiring researchers; if you are interested, please apply!

(Update: We have wrapped up our hiring "round" for early 2022, but are still accepting researcher applications on a rolling basis here, though it may take us longer to get back to you.)

What is ARC?

ARC is a non-profit organization focused on theoretical research to align future machine learning systems with human interests. We are aiming to develop alignment strategies that would continue to work regardless of how far we scaled up ML or how ML models end up working internally.

Probably the best way to understand our work is to read Eliciting Latent Knowledge, a report describing some recent and upcoming... (read 287 more words →)