Logan Riggs — LessWrong

A Review of Nina Panickssery’s Review of Scott Alexander’s Review of “If Anyone Builds It, Everyone Dies”

For instance, if the model was at every turn searching over possible actions and choosing one that will maximize this reward function. (Of course, no one knows how to do that in practice yet, but everyone’s on the same page about that.)

I'm interpreting you as saying "If we solved outer alignment & had a perfect reward function, it would be good if the model itself was optimizing for that reward function (ie inner alignment)"

In which case, we are not on the same page (ie inner & outer alignment decompose a hard problem into two more difficult ones).

For the book, it's interesting they went w/ the evolution argument. I still prefer the shard theory analogy of humans being misaligned w/ the reward system (ie I intentionally avoid taking fentanyl, even though that would be very rewarding/reinforcing), which can still end up in similar sharp-left turns if the model eg takes fentanyl (or other goals).

Evolution is still not believed by everyone in the US (though oddly ranging from 17%-37%), which can be offputting to some, & also you have to understand evolution to an extent. I assume most folks can sort'of see that if you really optimized for evolution, you'd do a lot more than we are to pass on genes; however, optimizing for "evolution" is underconstrained & can have arguments for "well we're actually still doing quite good by evolution's sake".

Now instead let's focus on optimizing for the human reward system. People believe in very addictive drugs & can see the effects. It's pretty easy to imagine "a drug addict becomes extremely powerful, and you try to stop them. What goes wrong?". It's also quite coherent what optimizing your reward system looks like!

The evolution analogy is still good under the inner-outer alignment frame, since humans would be evolution's seat & it seems difficult to avoid the same issues. Whereas the human reward system seems easier (eg give the AI fentanyl). This can be worked around by discussing how hard it is to design the perfect reward function which doesn't end up goodharting.

My Empathy Is Rarely Kind

Logan Riggs2mo31

I do take a hard deterministic stance, so I'd like to hear your thoughts here. Do you agree w/ the following?

People literally can't make different choices due to determinism
Laws & punishments are still useful for setting the right incentives that lead to better outcomes
You're allowed to have negative emotions given other people's actions (see #1), but those emotions don't necessarily lead to better outcomes or incentives

I remember being 9 years old & being sad that my friend wasn't going to heaven. I even thought "If I was born exactly like them, I would've made all the same choices & had the same experiences, and not believe in God". I still think that if I'm 100% someone else, then I would end up exactly as they are.

I think the counterfactual you're employing (correct me if wrong) is "if my brain was in their body, then I wouldn't..." or "if I had their resources, then I wouldn't...", which is saying you're only [80]% that person. You're leaving out a part of them that made them who they are.

Now, you could still argue #2, that these negative emotions set correct incentives. I've only heard second-hand of extreme situations where that worked ^[1], but most of the time backfires

Son calls their parent after a while "Oh son, you never call! Shame shame"
Child says their sorry, but the parent demands them to show/feel remorse or it doesn't count.
Guilt tripping in general, lol

What do you think?

^{^}
One of my teacher's I still talk to pushed a student against the wall, yelling at them that they're wasting their life w/ drugs/etc, fully expecting to get fired afterwards. They didn't get fired & the student cleaned up (I believe this was in the late 90's though)

Spectral radii dimensionality reduction computed without gradient calculations

Logan Riggs4mo1-1

That does clarify a lot of things for me, thanks!

Looking at your posts, there’s no hooks or trying to sell your work, which is a shame cause LSRDR’s seem useful. Since they are you useful, you should be able to show it.

For example, you trained an LSRDR for text embedding, which you could show at the beginning of the post. Then showing the cool properties of pseudo-determinism & lack of noise compared to NN’s. THEN all the maths. So the math folks know if the post is worth their time, and the non-math folks can upvote and share with their mathy friends.

I am assuming that you care about [engagement, useful feedback, connections to other work, possible collaborators] here. If not, then sorry for the unwanted advice!

I’m still a little fuzzy on your work, but possible related papers that come to mind are on tensor networks.

Compositionality Unlocks Deep Interpretable Models - they efficiently train tensor networks on [harder MNIST], showing approximately equivalent loss to NN’s, and show the inherent interpretability in their model.
Tensorization is [Cool essentially] - https://arxiv.org/pdf/2505.20132 - mostly a position and theoretical paper arguing why tensorization is great and what limitations.
Im pretty sure both sets of authors here read LW as well.

Spectral radii dimensionality reduction computed without gradient calculations

Logan Riggs4mo2-1

Is the LSRDR a proposed alternative to NN’s in general?

What interpretability do you gain from it?

Could you show a comparison between a transformer embedding and your method with both performance and interpretability? Even MNIST would be useful.

Also, I found it very difficult to understand your post (Eg you didn’t explain your acronym! I had to infer it). You can use the “request feedback” feature on LW in the future; they typically give feedback quite quickly.

Winning the power to lose

Logan Riggs4mo30

Gut reaction is “nope!”.

Could you spell out the implication?

Winning the power to lose

Logan Riggs4mo30

Correct! I did mean to communicate that in the first footnote. I agree value-ing the unborn would drastically lower the amount of acceptable risk reduction.

Winning the power to lose

Logan Riggs4mo50

I agree w/ your general point, but think your specific example isn't considering the counterfactual. The possible choices aren't usually:

A. 50/50% chance of death/utopia
B. 100% of normal life

If a terminally ill patient would die next year 100%, then choice (A) makes sense! Most people aren't terminally ill patients though. In expectation, 1% of the people you know will die every year (w/ skewing towards older people). So a 50% of death vs utopia shouldn't be preferred by most people, & they should accept a delay of 1 year of utopia for >1% reduction in x-risk.^[1]

I can imagine someone's [husband] being terminally ill & they're willing to roll the dice; however, most people have loved ones that are younger (e.g. (great)-children, nephews/nieces, siblings, etc) which would require them to value their [husband] vastly greater than everyone else.^[2]

^{^}
However if normal life is net-negative, then either death or utopia would be preferred, changing the decision. This is also a minority though.
^{^}
However, folks could be short-sighted. Thinking to minimize the suffering of their loved one in front of them, w/o considering the negative effects of their other loved ones. This isn't utility function relevant, just a better understanding of the situation.

Winning the power to lose

Logan Riggs4mo135

AFAIK, I have similar values^[1] but lean differently.

~1% of the world dies every year. If we accelerate AGI sooner 1 year, we save 1%. Push back 1 year, lose 1%. So, pushing back 1 year is only worth it if we reduce P(doom) by 1%.

This means you're P(doom) given our current trajectory very much matters. If you're P(doom) is <1%, then pushing back a year isn't worth it.

The expected change conditioning on accelerating also matters. If accelerating by 1 year increases e.g. global tensions, increasing a war between nuclear states by X% w/ an expected Y-deaths (I could see arguments either way though, haven't thought too hard about this).

For me, I'm at ~10% P(doom). Whether I'd accept a proposed slowdown depends on how much I expect it decrease this number.^[2]

How do you model this situation? (also curious on your numbers)

Assumptions:

We care about currently living people equally (alternatively, if you cared mostly about your young children, you'd happily accept a reduction in x-risk of 0.1% (possibly even 0.02%). Actuary table here)
Using expected value, which only mostly matches my intuitions (e.g. I'd actually accept pushing back 2 years for a reduction of x-risk from 1% to ~0%)

^{^}
I mostly care about people I know, some for people in general, and the cosmic endownment would be nice, sure, but only 10% of the value for me.
^{^}
Most of my (currently living) loved ones skew younger, ~0.5% expected death-rate, so I'd accept a lower expected reduction in x-risk (maybe 0.7%)

SAE vs. RepE

Logan Riggs4mo60

"focus should no longer be put into SAEs...?"

I think we should still invest research into them BUT it depends on the research.

Less interesting research:

1. Applying SAEs to [X-model/field] (or Y-problem w/o any baselines)

More interesting research:

Problems w/ SAEs & possible solutions
1. Feature supression (solved by post-training, gated-SAEs, & top-k)
2. Feature absorption (possibly solved by Matryoshka SAEs)
3. SAE's don't find the same features across seeds (maybe solved by constraining latents to the convex hull of the data)
4. Dark-matter of SAEs (nothing AFAIK)
5. Many more I'm likely forgetting/haven't read
Comparing SAEs w/ strong baselines for solving specific problems
Using SAEs to test how true the linear representation hypothesis is
Changing SAE architecture to match the data

In general, I'm still excited about an unsupervised method that finds all the model's features/functions. SAE's are one possible option, but others are being worked on! (APD & L3D for a weight-based method)

Relatedly, I'm also excited about interpretable-from-scratch architectures that do lend themselves more towards mech-interp (or bottom-up in Dan's language).

SAE vs. RepE

Logan Riggs4mo30

Just on the Dallas example, look at this +8x & -2x below

So they 8x all features in the China super-node and multiplied the Texas supernode (Texas is "under" China, meaning it's being "replaced") by -2x. That's really weird! It should be multiplying Texas node by 0. If Texas is upweighting "Austin", then -2x-ing it could be downweighting "Austin", leading to cleaner top outputs results. Notice how all the graphs have different numbers for upweighting & downweighting (which is good that they include that scalar in the images). This means the SAE latents didn't cleanly separate the features (we think exist).

(With that said, in their paper itself, they're very careful & don't overclaim what their work shows; I believe it's a great paper overall!)

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments