tmeanen

Posts

Sorted by New

Wiki Contributions

Comments

tmeanen5-4

But if the core difficulty in solving alignment is developing some difficult mathematical formalism and figuring out relevant proofs then I think we won't suffer from the problems with the technologies above. In other words, I would feel comfortable delegating and overseeing a team of AIs that have been tasked with solving the Riemann hypothesis - and I think this is what a large part of solving alignment might look like.

I've been in a number of arguments where people say things like "why is 90% doom such a strong claim? That assumes that survival is the default! "

Am I misunderstandng this sentence? How do "90% doom" and the assumption that survival is the default square with one another? 

“keyboard and monitor I’m using right now, a stack of books, a tupperware, waterbottle, flip-flops, carpet, desk and chair, refrigerator, sink, etc. Under my models, if I pick one of these objects at random and do a deep dive researching that object, it will usually turn out to be bad in ways which were either nonobvious or nonsalient to me, but unambiguously make my life worse"

But, I think the negative impacts that these goods have on you are (mostly) realized on longer timescales - say, years to decades. If you’re using a chair that is bad for your posture, the impacts of this are usually seen years down the line when your back starts aching. Or if you keep microwaving tupperware, you may end up with some pretty nasty medical problems, but again, decades down the line. 

The property of an action having long horizons until it can be verified as good or bad for you makes delegating to smarter-than-you systems dangerous. My intuition is that there are lots of tasks that could significantly accelerate alignment research that don’t have this property, examples being codebase writing (unit tests can provide quick feedback), proof verification etc. In fact, I can’t think of many research tasks in technical fields that have month/year/decade horizons until they can be verified - though maybe I’ve just not given it enough thought.

Plausibly one technology that arrives soon after superintelligence is powerful surveillance technology that makes enforcing commitments significantly easier than it historically has been. Leaving aside the potential for this to be misused for authoritarian government, advocating for this to be developed before powerful technologies of mass destruction may be a strategy.  

tmeanen20

Nice, I like this concept of rogue deployment as it highlights two distinct features that are both required for a safety method to be considered 'successful'. I'm understanding catastrophe with rogue deployment as having good enough safety measures but these safety measures were bypassed/turned off, whereas catastrophe without rogue deployment involves having safety measures that were fully operational the whole time but insufficient to prevent a model/human actor from causing a catastrophe.  

So for example, we could get really great mech. interp tools, but avoiding catastrophe isn't guaranteed if all of these mech. interp tools are running on a single server (making them very easy to disable). To prevent rogue deployment we’d want multiple servers running these mech. interp tools to provide redundancy in case one goes down/gets hacked etc. So there's a concept here of the raw effectiveness of a safety method as well as its reliability. I'm sure others can probably think of more nuanced examples too.