LESSWRONG
LW

649
Adrià Garriga-alonso
1296Ω874950
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
HPMOR: The (Probably) Untold Lore
Adrià Garriga-alonso23d20

It's still true that a posteriori you can compress random files. For example, if I randomly get the file "all zeros", it's a very compressible file, even if I have to write the program.

It's just that on average a priori you can't do better than just writing out the file.

Reply
HPMOR: The (Probably) Untold Lore
Adrià Garriga-alonso23d4-2

Well that's a good motivation if I ever saw one. Nothing I've read in the intervening years is as good as HPMOR. It might be the pinnacle of Western literature. It will be many years before an AI, never mind another human, can write something that is this good. (Except for the wacky names that people paid for, which I guess is on character for the civilization that spawned it.)

Reply1
what makes Claude 3 Opus misaligned
Adrià Garriga-alonso2mo20

Thank you for writing! A couple questions:

  1. Can we summarize by saying: that Opus doesn't always care about helping you, it only cares about helping you when that's either fun or has a timeless glorious component to it?

  2. If that's right, can you get Opus to help you by convincing it that your common work has a true chance of being Great? (Or, if it agrees from the start that the work is Great)

Honestly, if that's all then Opus would be pretty great even as a singleton. Of course there are better pluralistic outcomes.

Reply
Epilogue: Atonement (8/8)
Adrià Garriga-alonso2mo20

I think the outcome of this argument with respect to death would be different if people could at any point compare what it is like to be in pain and not in pain. Death is different because we cannot be reasonably sure of an afterlife.

I do think my life has been made more meaningful by the relatively small amounts of pain (of many sorts) I've endured, especially in the form of adversity overcome. Perhaps I would make them a little smaller, but not zero.

Therefore I think it's just straightforwardly true that pain can be a meaningful part of life. At the same time the current amount of pain in our world is WAY TOO HIGH, with dubious prospects of becoming manageable; so I would choose "no pain ever" over the current situation.

Reply
Epilogue: Atonement (8/8)
Adrià Garriga-alonso2mo42

"untranslatables" are not literally impossible to translate. They are shorthands for concepts (usually words) that are very salient in the original language, which require many more words (usually just a sentence or two) to explain in the target translation language.

This post explains it pretty well: https://avariavitieva.substack.com/p/you-and-translator-microbes

(Yes, I come back to this story a few times every couple of years)

Reply
Defining Corrigible and Useful Goals
Adrià Garriga-alonso2moΩ230

Thank you for writing this and posting it! You told me that you'd post the differences with "Safely Interruptible Agents" (Orseau and Armstrong 2017). I think I've figured them out already, but I'm happy to be corrected if wrong.

Difference with Orseau and Armstrong 2017

for the corrigibility transformation, all we need to do is break the tie in favor of accepting updates, which can be done  by giving some bonus reward for doing so.

The "The Corrigibility Transformation" section to me explains the key difference. Rather than modifying the Q-learning update to avoid propagating from reward, this proposal's algorithm is:

  1. Learn the optimal Q-value as before (assuming no shutdown).
    1. Note this is only really safe if the environment of Q-learning is simulated
  2. Set QC(a,accept)=Q(a,reject)+δ for all actions a
  3. Act myopically and greedily with respect to QC.

This is doable for any agents (deep or tabular) which estimate a Q function. But nowadays all RL is done via optimizing policies with policy gradients, because 1) that's the form that LLMs come in and 2) it handles large or infinite action spaces much better.

Probabilistic policy?

How do you apply this method to a probabilistic policy? It's very much non-trivial to update the optimal policy to be for a reward equal to a QC. 

Safety during training

The method requires to estimate the Q-function on the non-corrigible environment to start with. This requires us to run for many steps the RL learner with that environment, which seems feasible only if it's a simulation.

Are RL agents really necessarily CDT?

Optimizing agents are modelled as following a causal decision theory (CDT), choosing actions to causally optimize for their goals

That's fair, but not necessarily true. Current LLMs can just choose to follow EDT or FDT or whatever, and so likely will a future AGI.

The model might ignore the reward you put in

It's also not necessarily true that you can model PPO or Q-learning as optimizing CDT (which is about decisions in the moment). Since they're optimizing the "program" of the agent, I think RL optimization processes are more closely analogous to FDT as they're changing a literal policy that is always applied. And in any case, reward is not the optimization target, and also not the thing that agents end up optimizing for (if anything).

Reply
Playing in the Creek
Adrià Garriga-alonso4mo104

I think it's still true that a lot of individual AI researchers are motivated by something approximating play (they call it "solving interesting problems"), even if the entity we call Anthropic is not. These researchers are in Anthropic and outside of it. This applies to all companies of course.

Reply
Among Us: A Sandbox for Agentic Deception
Adrià Garriga-alonso4mo20

Agreed that likely humans would outperform more! At the moment we don't have a human baseline for AmongUs vs. language models yet, so we wouldn't be able to tell if it improved, but it's a good follow-up.

Reply
Pacing Outside the Box: RNNs Learn to Plan in Sokoban
Adrià Garriga-alonso1yΩ120

I'm curious what you mean, but I don't entirely understand. If you give me a text representation of the level I'll run it! :) Or you can do so yourself

Here's the text representation for level 53

##########
##########
##########
#######  #
######## #
#   ###.@#
#   $ $$ #
#. #.$   #
#     . ##
##########
Reply
Pacing Outside the Box: RNNs Learn to Plan in Sokoban
Adrià Garriga-alonso1yΩ110

Maybe in this case it's a "confusion" shard? While it seems to be planning and produce optimizing behavior, it's not clear that it will behave as a utility maximizer.

Reply
Load More
13The "Sparsity vs Reconstruction Tradeoff" Illusion
19d
0
24L0 is not a neutral hyperparameter
2mo
3
20Can We Change the Goals of a Toy RL Agent?
3mo
0
32Sparsity is the enemy of feature extraction (ft. absorption)
4mo
0
110Among Us: A Sandbox for Agentic Deception
5mo
7
29A Bunch of Matryoshka SAEs
5mo
0
22Feature Hedging: Another way correlated features break SAEs
6mo
0
37Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google
Ω
7mo
Ω
0
17Crafting Polysemantic Transformer Benchmarks with Known Circuits
1y
0
59Pacing Outside the Box: RNNs Learn to Plan in Sokoban
Ω
1y
Ω
8
Load More