All of phone.spinning's Comments + Replies

The issue with early finetuning is that there’s not much that humans can actually select on, because the models aren’t capable enough - it’s really hard for me to say that one string of gibberish is better/worse.

That’s why I say as early as possible, and not right from the very start.

I think the issue with the more general “neocortex prosthesis” is that if AI safety/alignment researchers make this and start using it, every other AI capabilities person will also start using it.

Yup, that's a problem. The problem also exists with regard to an alignment assistant, although the problem is exacerbated here because "retargetable" is part of the specification. On the other hand, unlike the AI Assistant paradigm, a neocortex prothesis need not be optimized to be user-friendly, and will probably have a respectable learning curve, which makes instant/universal adoption by others less likely. There are also other steps that could be taken to mitigate risks (e.g. siloeing information). Second-order impacts are important to consider, but I also think it's productive to think separately about the problem of what systems would be the most useful to alignment researchers.
While I'm not so sure about this since GPT-3 came out in early 2020 and very few people have used it to its potential (though that number will certainly grow with ease-of-use tools like ChatGPT), your issue is way more likely in the case if there is a publicly available demo [] vs a few alignment researchers using it in private. That said, it's still very much something to be concerned and careful about.
  • unreasonable ^5



I think there's a typographical error - this doesn't link to any footnote for me, and there doesn't appear to be a fifth footnote at the end of the post

Geoffrey and others raised this general problem several years ago (e.g. here)

This link no longer works - I get a permission denied message.

This post is short, but important. The fact that we regularly receive enormously improbable evidence is relevant for a wide variety of areas. It's an integral part of having accurate beliefs, and despite this being such a key idea, it's underappreciated generally (I've only seen this post referenced once, and it's never come up in conversation with other rationalists). 

Has anyone thought about the best ways of intentionally inducing the most likely/worst kinds of misalignment in models, so we can test out alignment strategies on them? I think red teaming kinda fits this, but that’s more focused on eliciting bad behavior, instead of causing a more general misalignment. I’m thinking about something along the lines of “train with RLHF so the model reliably/robustly does bad things, and then we can try to fix that and make the model good/non-harmful”, especially in the sandwiching context where the model is more capable than... (read more)

Make it as dangerous as possible, to see if we can control it? cough Wuhan. cough

Is there any other reason to think that scalable oversight is possible at all in principle, other the standard complexity theory analogy? I feel like this is forming the basis of a lot of our (and other’s) work in safety, but I haven’t seen work that tries to understand/conceptualize this analogy concretely.

Is anyone thinking about how to scale up human feedback collection by several orders of magnitudes? A lot of alignment proposals aren’t focused on the social choice theory questions, which I’m okay with, but I’m worried that there may be large constant factors in the scalability of human feedback strategies like amplification/debate, such that there could be big differences between collecting 50k trajectories versus say 50-500M. Obviously cost/logistics are a giant bottleneck here, but I’m wondering about what other big challenges might be (especially if we could make intellectual progress on this before we may need to)

How does shard theory differ from the Olah-style interpretability agenda? Why is there any reason to believe we can learn about "shards" without interpretability? 

This is about 100T tokens, assuming ~2 tokens per word. That's quite a lot of supervision.

When doing sandwiching experiments, a key property of your "amplification strategy" (i.e. the method you use to help the human complete the task) should only help the person complete the task correctly. 

For example, lets say you have a language model give arguments for why a certain answer to a question is correct. This is fine, but we don't want it to be the case that the system is also capable of convincing the person of an incorrect answer. In this example, you can easily evaluate this, by prompting or finetuning the model to argue for incorrect an... (read more)

If people read at 250 words/minute, and a page of text has 500 words, and we could get 10M people to read AI outputs full-time, we could have human-evaluation of about 100B pages of text per year, which actually feels surprisingly high

This is about 100T tokens, assuming ~2 tokens per word. That's quite a lot of supervision.

Has anyone tried debate experiments where the judge can interject/ask questions throughout the debate?

It’s so easy to get caught up in meta-thinking - I want to try to remember to not spend more than maybe 10% of my time generally doing meta-reflection, process optimization, etc., and spend at least 90% of my time working directly on the concrete goal in front of me (LM alignment research, right now)

Epistemic status: I’m somewhat confident this is a useful axis to describe/consider alignment strategies/perspectives, but I’m pretty uncertain which is better. I could be missing important considerations, or weighing the considerations listed inaccurately.

When thinking about technical alignment strategy, I’m unsure whether it’s better to focus on trying to align systems that already exist (and are misaligned), or to focus on procedures that train aligned models from the start.

The first case is harder, which means focusing on it is closer to minimax optimi... (read more)

SSL models trained on real observations can be thought of [] as maps [], so that tuning them without keeping this point of view in mind risks changing the map in a way that is motivated by something other than aligning it with the territory. In particular, fine-tuning might distort the map, and amplification might generate sloppy fiction to be accepted as territory. A proper use of fine-tuning in this frame is as search for high fidelity depictions of aligned agents, zooming in the map by conditioning it to be about particular situations we are looking for. This is different from realigning the whole map with reinforcement learning, which might get it to lose touch with the ground truth of original training data. And a proper use of amplification is as reflection on blank spaces on the map, or low fidelity regions, extrapolating past/future details and other hidden variables of the territory from what the map does show, and learning how it looks when added to the map.

I would have really appreciated documentation on this, fwiw! 

Hmm yeah that’s fair, but I think what I said stands as a critique of a certain perspective on alignment, insofar as I think having the alignment curve grow faster at every step is equivalent to solving the core hard problem. I agree that we need to solve the core hard problem, but we need to delay fast takeoff until we are very confident that the problems are solved.

1the gears to ascenscion5mo
ah, yeah, arguing for the incrementalization of alignment - strongly agreed there!

'The goal of alignment research should be to get us into "alignment escape velocity", which is where the rate of alignment progress (which will largely come from AI as we progress) is fast enough to prevent doom for enough time to buy even more time.'


^ the above argument only works if you think that there will be a relatively slow takeoff. If there is a fast takeoff, the only way to buy more time is to delay that takeoff, because alignment won't scale as quickly as capabilities under a period of significant and rapid recursive self-improvement. 

4the gears to ascenscion5mo
that's not less true under fast takeoff - you still need the alignment curve to grow faster at every step

Alignment is a stabilizing force against fast takeoff, because the models will not want to train models that don't do what *they* want. So, the goals/values of the superintelligence we get after a takeoff might actually end up being the values of models that are just past the point of capability where they are able to align their successors. I'd expect these values to be different from the values of the initial model that started the recursive self-improvement process, because I don't expect that initial model to be capable of solving (or caring about) alignment enough, and because there may competitive dynamics that cause ~human-level AI to train successors that are misaligned to it. 

I like this idea and think it is worth exploring. It is not even just with training new models; AGI have to worry about misalignment with every self-modification and every interaction with the environment that changes itself. Perhaps there are even ways to deter an AGI from self-improvement, by making misalignment more likely. Some caveats are: * AGI may not take alignment seriously. We already have plenty of examples of general intelligences who don't. * AGI can still increase its capabilities without training new models, e.g. by getting more compute * If an AGI decides to solve alignment before significant self-improvement, it will very likely be overtaken by other humans or AGI who don't care as much about alignment.
1[comment deleted]5mo

I think AI of the capability level that you describe will either already have little need to exploit people, or will quickly train successors that wouldn’t benefit from this. I do think deception is a big issue, but I think the important parts of deception will be earlier in terms of AI capability than you describe.

Which suggests that if you're doing randomish exploration, you should try to shake things up and move in a bunch of dimensions at once rather than just moving along a single identified dimension.


If you can only do randomish exploration this sounds right, but I think this often isn't the right approach (not saying you would disagree with this, just pointing it out). When we change things along basis vectors, we're implicitly taking advantage of the fact that we have a built-in basis for the world (namely, our general world model). This lets us reason about things like causality, constraints, etc. since we already are parsing the world into a useful basis.