An undergrad at University of Maryland, College Park. Majoring in math.

After finishing The Sequences at the end of 9th grade, I started following the EA community, changing my career plans to AI alignment. If anyone would like to work with me on this, PM me!

I’m currently starting the EA group for the university of maryland, college park.

Also see my EA Forum profile

Wiki Contributions


Crystalizing an agent's objective: how inner-misalignment could work in our favor

Doesn’t this post assume we have the transparency capabilities to verify the AI has human-value-preserving goals, which the AI can use? The strategy seems relevant if these tools verifiably generalize to smarter-than-human AIs, and its easy to build aligned human-level AIs.

Crystalizing an agent's objective: how inner-misalignment could work in our favor

Nevermind, I figured it out. It's use is to get SGD to update your model in the right direction. The above 3 uses only allow you to tell whether your model is unaligned, not ncessarily how to keep it aligned. This idea seems very cool!

Crystalizing an agent's objective: how inner-misalignment could work in our favor

Interesting concept. If we have interpretability tools sufficient to check whether a model is aligned, what is gained by having the model use these tools to verify its alignment?

Other ideas for how you can use such an introspective check to keep your model aligned:

  • Use an automated, untrained, system
  • Use a human
  • Use a previous version of the model
We will be around in 30 years

Saying "I disbelieve <claim>" is not an argument even when <claim> is very well defined. Saying "I disbelieve <X>, and most arguments for <Y> are of the form <X> => <Y>, so I'm not convinced of <Y>" is admittedly more of an argument than the original statement, but I'd still classify it as not-an-argument unless you provide justification for why <X> is false, especially when there's strong reason to believe <X>, and strong reason to believe <Y> even if <X> is false! I think your whole post is of this latter type of statement.

I did not find your post constructive because it made a strong & confident claim in the title, then did not provide convincing argumentation for why that claim was correct, and did not provide any useful information relevant to the claim which I did not already know. Afterwards I thought reading the post was a waste of time.

I'd like to see an actual argument which engages with the prior-work in this area in a non-superficial way. If this is what you mean by writing up your thoughts in a lengthier way, then I'm glad to hear you are considering this! If you mean you'll provide the same amount of information and same arguments, but in a way which would take up more of my time to read, then I'd recommend against doing that.

We will be around in 30 years

Downvoting because of lack of arguments, not the dissenting opinion. I also reject the framing in the beginning implying that if the post is downvoted to oblivion, then its because of you expressing a dissenting opinion rather than your post actually being non-constructive (though I do see it was crossed out, and so I’m trying not to factor that into my decision).

AGI Ruin: A List of Lethalities

This makes more sense. I think you should clarify that this is what you mean when talking about the null string analogy in the future, especially when talking about what thinking about hard-to-think-about topics should look like. It seems fine, and probably useful, as long as you know it's a vast overstatement, but because it's a vast overstatement, it doesn't actually provide that much actionable advice. 

Concretely, instead of talking about the null string, it would be more helpful if you talked about the amount of discussion it should take a prospective researcher to reach correct conclusions. From literal null-string for the optimal agent, to vague pointing in the correct direction for a pretty good researcher, to a fully formal and certain proof listing every claim and counter-claim imaginable for someone who probably shouldn't go into alignment.

AGI Ruin: A List of Lethalities

I did read the linked tweet, and now that you bring it up, my third sentence doesn't apply. But I think my first & second sentences do still apply (ignoring Eliezer's recent clarification).

AGI Ruin: A List of Lethalities

This is not at all analogous to the point I'm making. I'm saying Eliezer likely did not arrive at his conclusions in complete isolation to the outside world. This should not change the credence you put on his conclusions except to the extent you were updating on the fact it's Eliezer saying it, and the fact that he made this false claim means that you should update less on other things Eliezer claims.

AGI Ruin: A List of Lethalities

[small nitpick] 

I figured this stuff out using the null string as input, and frankly, I have a hard time myself feeling hopeful about getting real alignment work out of somebody who previously sat around waiting for somebody else to input a persuasive argument into them.  This ability to "notice lethal difficulties without Eliezer Yudkowsky arguing you into noticing them" currently is an opaque piece of cognitive machinery to me, I do not know how to train it into others.  It probably relates to 'security mindset', and a mental motion where you refuse to play out scripts, and being able to operate in a field that's in a state of chaos.

I find this hard to believe. I'm sure you had some conversations with others which allowed you to arrive at these conclusions. In particular, your Intelligence Explosion Microeconomics paper uses the data from the evolution of humans to make the case that making intelligence higher was easy for evolution once the ball got rolling, which is not the null string.

What DALL-E 2 can and cannot do

Perhaps there is no operation of negation on cats in it's model. I'd predict it'd have an easier time just taking things out of pictures, so the prompt "a picture of my bed with no sheets" should produce a bed with no sheets. Perhaps if you wrote "This picture has no cats in it. The title is 'the opposite of a cat'", then I am uncertain about the output.

Load More