https://mesaoptimizer.com
I expect this to produce far more capabilities relevant insights than alignment relevant insights, especially when compared to worlds where Neel et al went in with the sole goal of separating out theories of value formation, and then did nothing else.
I assume here you mean something like: given how most MI projects seem to be done, the most likely output of all these projects will be concrete interventions to make it easier for a model to become more capable, and these concrete interventions will have little to no effect on making it easier for us to direct a model towards having the 'values' we want it to have.
I agree with this claim: capabilities generalize very easily, while it seems extremely unlikely for there to be 'alignment generalization' in a way that we intend, by default. So the most likely outcome of more MI research does seem to be interventions that remove the obstacles that come in the way of achieving AGI, while not actually making progress on 'alignment generalization'.
I see -- you are implying that an AI model will leverage external system parts to augment itself. For example, a neural network would use an external scratch-pad as a different form of memory for itself. Or instantiate a clone of itself to do a certain task for it. Or perhaps use some sort of scaffolding.
I think these concerns probably don't matter for an AGI, because I expect that data transfer latency would be a non-trivial blocker for storing data outside the model itself, and it is more efficient to to self-modify and improve one's own intelligence than to use some form of 'factored cognition'. Perhaps these things are issues for an ostensibly boxed AGI, and if that is the case, then this makes a lot of sense.
I doubt Nate Soares would advocate “overriding” per se
Acknowledged, that was an unfair characterization of Nate-style caring. I guess I wanted to make explicit two extremes. Perhaps using the name "Nate-style caring" is a bad idea.
(I now think that "System 1 caring" and "System 2 caring" would have been much better.)
a general idea of “optimizing hard” means higher risk of damage caused by errors in detail
Agreed.
“optimizing soft” has enough slack so as not to have the same risks, but also soft is less ambitious and likely less effective
I disagree with the idea that "optimizing soft" is less ambitious. "Optimizing soft", in my head, is about as ambitious as "optimizing hard", except it makes the epistemic uncertainty more explicit. In this model of caring I am trying to make more legible, I believe that Carlsmith-style caring may be more robust to certain epistemological errors humans can make that can result in severely sub-optimal scenarios, because it is constrained by human cognition and capabilities.
Note: I notice that this can also be said for Soares-style caring -- both are constrained by human cognition and capabilities, but in different ways. Perhaps both have different failure modes, and are more effective in certain distributions (which may diverge)?
I've noticed that there are two major "strategies of caring" used in our sphere:
Nate Soares obviously endorses staring unflinchingly into the abyss that is reality (if you are capable of doing so). However, I expect that almost-pure Soares-style caring (which in essence amounts to "shut up and multiply", and consequentialism) combined with inattention or an inaccurate map of the world (aka broken epistemics) can lead to making severely sub-optimal decisions.
The harder you optimize for a goal, the better your epistemology (and by extension, your understanding of your goal and the world) should be. Carlsmith-style caring seems more effective since it very likely is more robust to having bad epistemology compared to Soares-style caring.
(There are more pieces necessary to make Carlsmith-style caring viable, and a lot of them can be found in Soares' "Replacing Guilt" series.)
I did not expect what appears to me to be a non-superficial combination of concepts behind the input prompt and the mixing/steering prompt -- this has made me more optimistic about the potential of activation engineering. Thank you!
Partition (after which block activations are added)
Does this mean you added the activation additions once to the output of the previous layer (and therefore in the residual stream)? My first-token interpretation was that you added it repeatedly to the output of every block after, which seems unlikely.
Also, could you explain the intuition / reasoning behind why you only applied activation additions on encoders instead of decoders? Given that GPT-4 and GPT-2-XL are decoder-only models, I expect that testing activation additions on decoder layers would have been more relevant.
It would be lovely if you could also support a form of formatted export feature so that people can use this tool with the knowledge that they can export the data and switch to another tool (if this one gets Googled) anytime.
But yes, I am really excited for a super-fast and easy-to-use and good-looking prediction book successor. Manifold markets was just intimidating for me, and the only reason I got into it was social motivation. This tool serves a more personal niche for prediction logging, I think, and that is good.
running more than one copy of this system at a superhuman speed safely is something no one has any idea how to even approach, and unless this insanity is stopped so we have many more than four years to solve alignment, we’re all dead
My implication was that the quoted claim of yours was extreme and very likely incorrect ("we're all dead" and "unless this insanity is stopped", for example). I guess I failed to make that clear in my reply -- perhaps LW comments norms require you to eschew ambiguity and implication. I was not making an object-level claim about your timeline models.
Build an Alignment Research Assistant using a suite of LLMs managing various parts of the research process. Aims to 10-100x productivity in AI alignment research.
Can you give concrete use-cases that you imagine your project would lead to helping alignment researchers? Alignment researchers have wildly varying styles of work outputs and processes. I assume you aim to accelerate a specific subset of alignment researchers (those focusing on interpretability and existing models and have an incremental / empirical strategy for solving the alignment problem).
The supposed computations downstream of "chaotic behavior" does not seem to me to be load-bearing for systems being able to do non-trivially influential things in the real world.
Humans are not "indeterministic". Humans are deterministic computations that follow the laws of physics, and are not free from the laws of reality that constrain them.
These "not well-specified tasks" are tasks where we fill in the blanks based on our knowledge and experience of what makes sense "in distribution". This is not at all hard to do with an AI -- GPT-4 is a good real-world example of an AI capable of parsing what you call "not well-specified tasks", even if its actions do not seem to result in "superhuman" outcomes.
I recommend reading the Sequences. It should help reduce some of the confusion inherent to how you think about these topics.