All Posts

Sorted by Magic (New & Upvoted)

Friday, June 16th 2023
Fri, Jun 16th 2023

No posts for June 16th 2023
Shortform
12nim10h
Reading https://www.lesswrong.com/posts/nwJCzszw8gGjPTihM/i-still-think-it-s-very-unlikely-we-re-observing-alien [https://www.lesswrong.com/posts/nwJCzszw8gGjPTihM/i-still-think-it-s-very-unlikely-we-re-observing-alien] and pondering the Bigfoot thing. On the one hand, We Have Cameras Everywhere(TM). On the other hand -- pick any area of the pacific northwest and look at a map of where the permanent roads are. Pull it up side by side with a map of an area that you're familiar with. Zoom in on both, to a magnification you'd consider reasonable for imagining things at walking-around scale. Pan around on the PNW map and try to find a permanent road. It'll take a minute. Most land out here grows timber, sure. Timber is harvested roughly once every 30-50 years. At this point, I'd bet that every square mile of the area has been visited by humans. Forestry land is heavily trafficked once every few decades; conservation land is surveyed and studied and sometimes visited by tourists. The question, like a missing term in the Drake Equation, is when. The L term captures for-how-long, sure, but only implies a difference between "someone sent us radio signals for 100 years around 1000 AD" and "someone sent us radio signals for 100 years around 2000 AD". I have two cats who hate me. (not their fault, they came from an animal hoarding situation so they're probably kinda traumatized) They seem to think I'm noisy and conspicuous and I stink, and to their perceptions I certainly do. They despise being perceived. I can tell that they're in my house because I can check every nook and cranny and learn their favorite hidey-holes, and the food I put out for them gets eaten, and their litter boxes get full. But if this was out in the woods instead of the artificial and tightly controlled environment of my home, I would likely not know they're around, just like most hikers don't know when they're being watched by a mountain lion. The cats hate the places where I spend time, just as th
9Lauro Langosco4h
Thinking about alignment-relevant thresholds in AGI capabilities. A kind of rambly list of relevant thresholds: 1. Ability to be deceptively aligned 2. Ability to think / reflect about its goals enough that model realises it does not like what it is being RLHF’d for 3. Incentives to break containment exist in a way that is accessible / understandable to the model 4. Ability to break containment 5. Ability to robustly understand human intent 6. Situational awareness 7. Coherence / robustly pursuing it’s goal in a diverse set of circumstances 8. Interpretability methods break (or other oversight methods break) 1. doesn’t have to be because of deceptiveness; maybe thoughts are just too complicated at some point, or in a different place than you’d expect 9. Capable enough to help us exit the acute risk period Many alignment proposals rely on reaching these thresholds in a specific order. For example, the earlier we reach (9) relative to other thresholds, the easier most alignment proposals are. Some of these thresholds are relevant to whether an AI or proto-AGI is alignable even in principle. Short of 'full alignment' (CEV-style), any alignment method (eg corrigibility) only works within a specific range of capabilities: * Too much capability breaks alignment, eg bc a model self-reflects and sees all the ways in which its objectives conflicts with human goals. * Too little capability (or too little 'coherence') and any alignment method will be non-robust wrt to OOD inputs or even small improvements in capability or self-reflectiveness.
8johnswentworth1d
Consider two claims: * Any system can be modeled as maximizing some utility function, therefore utility maximization is not a very useful model * Corrigibility is possible, but utility maximization is incompatible with corrigibility, therefore we need some non-utility-maximizer kind of agent to achieve corrigibility These two claims should probably not both be true! If any system can be modeled as maximizing a utility function, and it is possible to build a corrigible system, then naively the corrigible system can be modeled as maximizing a utility function. I expect that many peoples' intuitive mental models around utility maximization boil down to "boo utility maximizer models", and they would therefore intuitively expect both the above claims to be true at first glance. But on examination, the probable-incompatibility is fairly obvious, so the two claims might make a useful test to notice when one is relying on yay/boo reasoning about utilities in an incoherent way.
3
4kuira1d
Sometimes I have an internal desire different to do something different than what I think should be done (for example, I might desire to play a game while also thinking the better choice is to read). I've been experimenting with using randomness to mediate this. I keep a D20 with me, give each side of the dispute some odds proportional to the strength of its resolve, and then roll the die. In theory, this means neither side will overpower the other, and even a small resolve still has a chance. I'm not sure how useful this is, but it's fun, and can sort of give me motivation (I've tried to internalize this kind of roll as a rule not to break without good reason). Also, when I'm merely deciding between some options, sometimes I'll roll more casually with equal odds, and it'll help me realize that I already know which it is I really wanted to do (if I don't like the roll's outcome).
3Dalcy Bremin15h
What's a good technical introduction to Decision Theory and Game Theory for alignment researchers? I'm guessing standard undergrad textbooks don't include, say, content about logical decision theory. I've mostly been reading posts on LW but as with most stuff here they feel more like self-contained blog posts (rather than textbooks that build on top of a common context) so I was wondering if there was anything like a canonical resource providing a unified technical / math-y perspective on the whole subject.
1
Wiki/Tag Page Edits and Discussion