LESSWRONG
LW

Joe Kwon
96Ω28320
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
1Joe Kwon's Shortform
4y
0
No wikitag contributions to display.
Claude wants to be conscious
Joe Kwon1y10

Makes sense, and I also don't expect the results here to be surprising to most people.

Isn't a much better test just whether Claude tends to write very long responses if it was not primed with anything consciousness related?

What do you mean by this part? As in if it just writes very long responses naturally? There's a significant change in the response lengths depending on whether it's just the question (empirically the longest for my factual questions), a short prompt preceding the question, a longer prompt preceding the question, etc. So I tried to control for the fact that having any consciousness prompt means a longer input to Claude by creating some control prompts that have nothing to do with consciousness -- in which case it had shorter responses after controlling for input length.

 

Basically because I'm working with an already RLHF'd model whose output lengths are probably most dominated by whatever happened in the preference tuning process, I try my best to account for that by having similar length prompts preceding the questions I ask.

Reply
Claude wants to be conscious
Joe Kwon1y10

Thanks for the feedback! In a follow-up, I can try creating various rewordings of the prompt for each value. But instead of just neutral rewordings, it seems like you are talking about the extent to which the tone of the prompt is implicitly encouraging behavior (output length) one way or the other, am I correct in interpreting that way? So e.g. have a much more subdued/neutral tone for the consciousness example? 

Reply
Highlights from Lex Fridman’s interview of Yann LeCun
Joe Kwon1y10

Does the median LW commenter believe that autoregressive LLMs will take us all the way to superintelligence?

Reply
Solving the Mechanistic Interpretability challenges: EIS VII Challenge 2
Joe Kwon1y10

Super cool stuff. Minor question, what does "Fraction of MLP progress" mean? Are you scaling down the MLP output values that get added to the residual stream? Thanks!

Reply
Stupid Question: Why am I getting consistently downvoted?
Joe Kwon2y12

FWIW I understand now what it's meant to do, but have very little idea how your protocol/proposal delivers positive outcomes in the world by emitting performative speech acts. I think explaining your internal reasoning/hypothesis for how emitting performative speech acts leads to powerful AI's delivering positive incomes would be helpful. 

Is such a "channel" necessary to deliver positive outcomes? Is it supposed to make it more likely that AI delivers positive outcomes? More details on what a success looks like to you here, etc.

Reply
Stupid Question: Why am I getting consistently downvoted?
Answer by Joe KwonNov 30, 202391

I skimmed The Snuggle/Date/Slap Protocol and Ethicophysics II: Politics is the Mind-Savior which are two recent downvoted posts of yours. I think they get negative karma because they are difficult to understand and it's hard to tell what you're supposed to take away from it. They would probably be better received if the content were written such that it's easy to understand what your message is at an object-level as well as what the point of your post is. 

 

I read the Snuggle/Date/Slap Protocol and feel confused about what you're trying to accomplish (is it solving AI Alignment?) and how the method is supposed to accomplish that. 

In the ethicophysics posts, I understand the object level claims/material (like the homework/discussion questions) but fail to understand what the goal is. It seems like you are jumping to grounded mathematical theories for stuff like ethics/morality which immediately makes me feel dubious. It's a too much, too grand, too certain kind of reaction. Perhaps you're just spitballing/brainstorming some ideas, but that's not how it comes across and I infer you feel deeply assured that it's correct given statements like "It [your theory of ethics modeled on the laws of physics] therefore forms an ideal foundation for solving the AI safety problem."

 

I don't necessarily think you should change whatever you're doing BTW just pointing out some likely reactions/impressions driving negative karma.

Reply
Introducing Fatebook: the fastest way to make and track predictions
Joe Kwon2y20

This is terrific. One feature that will be great to have, is a way to sort and categorize your predictions under various labels.

Reply
Human sexuality as an interesting case study of alignment
Joe Kwon3y-1-4

Sexuality is, usually, a very strong drive which has a large influence over behaviour and long term goals. If we could create an alignment drive as strong in our AGI we would be in a good position.

I don't think we'd be in a good position even if we instilled an alignment drive this strong in AGI

Reply
The Limit of Language Models
Joe Kwon3y1-1

To me, the caveats section of this post highlights the limited scope from which language models will be able to learn human values and preferences, given explicitly stated (And even implied-from-text) goals != human values as a whole. 

Reply
Alignment via prosocial brain algorithms
Joe Kwon3y90

Hi Cameron, nice to see you here : ) what are your thoughts on a critique like: human prosocial behavior/values only look the way they look and hold stable within-lifetimes, insofar as we evolved in + live in a world where there are loads of other agents with roughly equal power as ourselves? Do you disagree with that belief? 

Reply
Load More
3Are there any groupchats for people working on Representation reading/control, activation steering type experiments?
Q
1y
Q
1
2Claude wants to be conscious
1y
8
19[Linkpost] Faith and Fate: Limits of Transformers on Compositionality
2y
4
2The Intrinsic Interplay of Human Values and Artificial Intelligence: Navigating the Optimization Challenge
2y
1
39Paper: Forecasting world events with neural nets
3y
3
11Converging toward a Million Worlds
4y
1
2Partial-Consciousness as semantic/symbolic representational language model trained on NN
Q
4y
Q
3
1Joe Kwon's Shortform
4y
0
2Value of building an online "knowledge web"
Q
5y
Q
8