LESSWRONG
LW

maxnadeau
686Ω1171300
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
An alignment safety case sketch based on debate
maxnadeau4moΩ680

For more discussion of the hard cases of exploration hacking, readers should see the comments of this post.

Reply
The 4-Minute Mile Effect
maxnadeau5mo73

Typo: should be "Gell-Mann"

Reply
Tormenting Gemini 2.5 with the [[[]]][][[]] Puzzle
maxnadeau5mo30

I figured out the encoding, but I expressed the algorithm for computing the decoding in different language from you. My algorithm produces equivalent outputs but is substantially uglier. I wanted to leave a note here in case anyone else had the same solution.

 

Alt phrasing of the solution:

Each expression (i.e. a well-formed string of brackets) has a "degree", which is defined as the the number of well-formed chunks that the encoding can be broken up into. Some examples: [], [[]], and [-[][]] have degree one, [][], -[][[]], and [][[][]] have degree two, etc.

Here's a special case: the empty string maps to 0, i.e. decode("") = 0

When an encoding has degree one, you take off the outer brackets and do 2^decode(the enclosed expression), defined recursively. So decode([]) = 2^decode("") = 2^0 = 1, decode([[]]) = 2^decode([]) = 2, etc.

Negation works as normal. So decode([-[]]) = 2^decode(-[]) = 2^(-decode([])) = 2^(-1) = 1/2

So now all we have to deal with is expressions with degree >1.

When an expression has degree >1, you compute its decoding as the product of decoding of the first subexpression and inc(decode(everything after the first subexpression)). I will define the "inc" function shortly.

So decode([][[]])  = decode([]) * inc(decode([[]]) = 1 * inc(2)

decode([[[]]][][[]]) = decode([[[]]]) * inc(decode([][[]])) = 4 * inc(decode([]) * inc(decode([[]]))) = 4 * inc(1 * inc(2))

What is inc()? inc() is a function that computes a prime factorization of a number and the increments (from one prime to the next) all the prime bases. So inc(10) = inc(2 * 5) = 3 * 7 = 21, and inc(36) = inc(2^2 * 3^2) = 3^2 * 5^2 = 225. But inc() doesn't just take in integers, it can take in any number representable as a product of primes raised to powers. So inc(2^(1/2) * 3^(-1)) = 3^(1/2) * 5^(-1) = sqrt(3)/5. I asked the language models whether there's a standard name for the set of numbers definable in this way, and they didn't have ideas.

Reply
Six Thoughts on AI Safety
maxnadeau6mo10

On point 6, "Humanity can survive an unaligned superintelligence": In this section, I initially took you to be making a somewhat narrow point about humanity's safety if we develop aligned superintelligence and humanity + the aligned superintelligence has enough resources to out-innovate and out-prepare a misaligned superintelligence. But I can't tell if you think this conditional will be true, i.e. whether you think the existential risk to humanity from AI is low due to this argument. I infer from this tweet of yours that AI "kill[ing] us all" is not among your biggest fears about AI, which suggests to me that you expect the conditional to be true—am I interpreting you correctly?

Reply
We should start looking for scheming "in the wild"
maxnadeau6moΩ691

To make a clarifying point (which will perhaps benefit other readers): you're using the term "scheming" in a different sense from how Joe's report or Ryan's writing uses the term, right? 

I assume your usage is in keeping with your paper here, which is definitely different from those other two writers' usages. In particular, you use the term "scheming" to refer to a much broader set of failure modes. In fact, I think you're using the term synonymously with Joe's "alignment-faking"—is that right?

Reply
Open problems in emergent misalignment
maxnadeau6mo212

People interested in working on these sorts of problems should consider applying to Open Phil's request for proposals: https://www.openphilanthropy.org/request-for-proposals-technical-ai-safety-research/

Reply
Detecting Strategic Deception Using Linear Probes
maxnadeau7mo10

This section of our RFP has some other related work you might want to include, e.g. Orgad et al.

Reply
Six Thoughts on AI Safety
maxnadeau7mo10

I think the link in footnote two goes to the wrong place?

Reply1
Jesse Hoogland's Shortform
maxnadeau7mo30

I haven't read the paper, but based only on the phrase you quote, I assume it's referring to hacks like the one shown here: https://arxiv.org/pdf/2210.10760#19=&page=19.0

Reply
AI Timelines
maxnadeau8moΩ450

Do you think that cyber professionals would take multiple hours to do the tasks with 20-40 min first-solve times? I'm intuitively skeptical.

One (edit: minor) component of my skepticism is that someone told me that the participants in these competitions are less capable than actual cyber professionals, because the actual professionals have better things to do than enter competitions. I have no idea how big that selection effect is, but it at least provides some countervailing force against the selection effect you're describing.

Reply
Load More
117Research directions Open Phil wants to fund in technical AI safety
Ω
7mo
Ω
21
111Open Philanthropy Technical AI Safety RFP - $40M Available Across 21 Research Areas
Ω
7mo
Ω
0
60Update on Harvard AI Safety Team and MIT AI Alignment
3y
4
135Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley
Ω
3y
Ω
14