Joe Kwon — LessWrong

Makes sense, and I also don't expect the results here to be surprising to most people.

Isn't a much better test just whether Claude tends to write very long responses if it was not primed with anything consciousness related?

What do you mean by this part? As in if it just writes very long responses naturally? There's a significant change in the response lengths depending on whether it's just the question (empirically the longest for my factual questions), a short prompt preceding the question, a longer prompt preceding the question, etc. So I tried to control for the fact that having any consciousness prompt means a longer input to Claude by creating some control prompts that have nothing to do with consciousness -- in which case it had shorter responses after controlling for input length.

Basically because I'm working with an already RLHF'd model whose output lengths are probably most dominated by whatever happened in the preference tuning process, I try my best to account for that by having similar length prompts preceding the questions I ask.

Claude wants to be conscious

Joe Kwon2y10

Thanks for the feedback! In a follow-up, I can try creating various rewordings of the prompt for each value. But instead of just neutral rewordings, it seems like you are talking about the extent to which the tone of the prompt is implicitly encouraging behavior (output length) one way or the other, am I correct in interpreting that way? So e.g. have a much more subdued/neutral tone for the consciousness example?

Highlights from Lex Fridman’s interview of Yann LeCun

Joe Kwon2y10

Does the median LW commenter believe that autoregressive LLMs will take us all the way to superintelligence?

Solving the Mechanistic Interpretability challenges: EIS VII Challenge 2

Joe Kwon2y10

Super cool stuff. Minor question, what does "Fraction of MLP progress" mean? Are you scaling down the MLP output values that get added to the residual stream? Thanks!

Stupid Question: Why am I getting consistently downvoted?

Joe Kwon2y12

FWIW I understand now what it's meant to do, but have very little idea how your protocol/proposal delivers positive outcomes in the world by emitting performative speech acts. I think explaining your internal reasoning/hypothesis for how emitting performative speech acts leads to powerful AI's delivering positive incomes would be helpful.

Is such a "channel" necessary to deliver positive outcomes? Is it supposed to make it more likely that AI delivers positive outcomes? More details on what a success looks like to you here, etc.

Stupid Question: Why am I getting consistently downvoted?

Answer by Joe KwonNov 30, 202391

I skimmed The Snuggle/Date/Slap Protocol and Ethicophysics II: Politics is the Mind-Savior which are two recent downvoted posts of yours. I think they get negative karma because they are difficult to understand and it's hard to tell what you're supposed to take away from it. They would probably be better received if the content were written such that it's easy to understand what your message is at an object-level as well as what the point of your post is.

I read the Snuggle/Date/Slap Protocol and feel confused about what you're trying to accomplish (is it solving AI Alignment?) and how the method is supposed to accomplish that.

In the ethicophysics posts, I understand the object level claims/material (like the homework/discussion questions) but fail to understand what the goal is. It seems like you are jumping to grounded mathematical theories for stuff like ethics/morality which immediately makes me feel dubious. It's a too much, too grand, too certain kind of reaction. Perhaps you're just spitballing/brainstorming some ideas, but that's not how it comes across and I infer you feel deeply assured that it's correct given statements like "It [your theory of ethics modeled on the laws of physics] therefore forms an ideal foundation for solving the AI safety problem."

I don't necessarily think you should change whatever you're doing BTW just pointing out some likely reactions/impressions driving negative karma.

Introducing Fatebook: the fastest way to make and track predictions

Joe Kwon2y20

This is terrific. One feature that will be great to have, is a way to sort and categorize your predictions under various labels.

Human sexuality as an interesting case study of alignment

Joe Kwon3y-1-4

Sexuality is, usually, a very strong drive which has a large influence over behaviour and long term goals. If we could create an alignment drive as strong in our AGI we would be in a good position.

I don't think we'd be in a good position even if we instilled an alignment drive this strong in AGI

The Limit of Language Models

Joe Kwon3y1-1

To me, the caveats section of this post highlights the limited scope from which language models will be able to learn human values and preferences, given explicitly stated (And even implied-from-text) goals != human values as a whole.

Alignment via prosocial brain algorithms

Joe Kwon3y90

Hi Cameron, nice to see you here : ) what are your thoughts on a critique like: human prosocial behavior/values only look the way they look and hold stable within-lifetimes, insofar as we evolved in + live in a world where there are loads of other agents with roughly equal power as ourselves? Do you disagree with that belief?

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments