LESSWRONG
LW

4062
shawnghu
112Ω172450
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
2shawnghu's Shortform
8mo
14
No wikitag contributions to display.
The ‘strong’ feature hypothesis could be wrong
shawnghu12d10

Oh yeah, I'm certainly agreeing with the central intent of the post, now just clarifying the above discussion.

One clarification-- here, as stated, "mechanisms operating in terms of linearly represented atoms" doesn't constrain the mechanisms themselves to be linear, does it? SAE latents themselves are some nonlinear function of the actual model activations. But if the mechanisms are substantially nonlinear we're not really claiming much.

My own impression is that things are nonlinear unless proven otherwise, and a priori I would really strongly expect the strong linear representation hypothesis to be just false. In general it seems extremely wishful to hope that exactly those things that are nonlinear (in whatever sense we mean) are not important, especially since we employ neural networks specifically to learn really weird functions we couldn't have thought of ourselves.

Reply
Humanity Learned Almost Nothing From COVID-19
shawnghu13d20

Do you have any suggestions, or references to resources, for what individuals should do to be better prepared for another global pandemic?

Reply
Humanity Learned Almost Nothing From COVID-19
shawnghu13d30

Exceedingly virtuous exceptions exist, I'll praise the ones I know of at the end. ↩︎


Where?

Reply
The ‘strong’ feature hypothesis could be wrong
shawnghu13d10

I see. If I understand you correctly, a mechanism, whether human-interpretable or not, which seems somehow to be functionally separate but not explainable in terms of their operations on linear subspaces of activation space, would count as evidence against the strong feature hypothesis, right?

Aren't the MLPs in a transformer straightforward examples of this?

(BTW, I agree with the main thrust of the post. I think that the linear feature hypothesis in most usefully strong forms should be default-false unless proven otherwise; I appreciate the thing you said two comments up about how "disproving a vague hypothesis is a bit difficult").

Reply
shawnghu's Shortform
shawnghu14d20

I didn't do a lot of thorough research, but maybe I simply don't know how to.

I googled around for resources, which usually leads to... I don't know how to describe this, but short-form articles which are not very information dense and mutually contradictory, and I looked for opinions on Reddit and for an FAQ-like thing on /r/Ergonomics, which also didn't tell me much definitive except that a) people have a variety of problems due to their variety of body shapes and b) it is a normal thing to want a desk that's significantly lower than most desks.

I must have done some amount of Claude-querying, but it's intensive to figure out what the root problems are here and whether there are canonical solutions to them, possibly because of the fact that the resources Claude would most easily reference are the same inadequate ones I've just described. I bet that it's possible to figure this out with Claude if I go slowly and specifically enough, though.

I don't think I found anything even approaching a central resource which claims to be comprehensive (however opinionated). Something like what they have at /r/bodyweightfitness, for example, would be excellent by the standards described here.

Reply
The ‘strong’ feature hypothesis could be wrong
shawnghu14d10

It feels to me like evaluating any of the sentences in your comment rigorously requires a more specific definition of "feature".

Reply
The ‘strong’ feature hypothesis could be wrong
shawnghu15d10

While these are logically distinct things, can you think of an experiment that would be able to distinguish the two even in principle? In other words, you say "our existing techniques can't yet"-- but what would one that can distinguish these even look like?

Reply
My Empathy Is Rarely Kind
shawnghu19d30

I don't necessarily disagree with this way of looking at things.

Serious question-- how do you calibrate the standard by which you judge that something is good enough to warrant respect, or self-respect?

To illustrate what I mean, in one of your examples you judge your project teammates negatively for not having had the broad awareness to seriously learn ML ahead of time, in the absence of other obvious external stimuli to do so (like classes that would be hard enough to actually require that). The root of the negative judgment is that a better objective didn't occur to them.

How can you ever be sure that there isn't a better objective that isn't occurring to you, at any given time? More broadly, how can you be sure that there isn't just a generally better way of living, that you're messing up for not currently doing?

If, hypothetically, you encountered a better version of yourself that presented you with a better objective and ways of living better, would you retroactively judge your life up to the present moment as worse and less worthy of respect? (Perhaps, based on the answer to the previous problem, the answer is "yes", but you think this is an unlikely scenario.)

Reply
shawnghu's Shortform
shawnghu21d20

Does anyone have a rigorous reference or primer on computer ergonomics, or ergonomics in general? It's hard to find a reference that says with authority/solid reasoning what good ergonomics are and why, and solutions to common problems.

Reply
How to Feel More Alive
shawnghu1mo20

Hey, I think I relate to this! I didn't expect to see this phenomenon described anywhere, and I'm happy that you took the time to describe it.

I think I was able to improve on this (or, the aspects of this that were an issue for me) by coming up with new ways of expressing what I think and feel in lossier/less-precise/more vibe-driven ways.

Reply
Load More
2shawnghu's Shortform
8mo
14
20Disentangling Perspectives On Strategy-Stealing in AI Safety
Ω
4y
Ω
1