Reading
https://www.lesswrong.com/posts/nwJCzszw8gGjPTihM/i-still-think-it-s-very-unlikely-we-re-observing-alien
[https://www.lesswrong.com/posts/nwJCzszw8gGjPTihM/i-still-think-it-s-very-unlikely-we-re-observing-alien]
and pondering the Bigfoot thing.
On the one hand, We Have Cameras Everywhere(TM).
On the other hand -- pick any area of the pacific northwest and look at a map of
where the permanent roads are. Pull it up side by side with a map of an area
that you're familiar with. Zoom in on both, to a magnification you'd consider
reasonable for imagining things at walking-around scale. Pan around on the PNW
map and try to find a permanent road. It'll take a minute.
Most land out here grows timber, sure. Timber is harvested roughly once every
30-50 years.
At this point, I'd bet that every square mile of the area has been visited by
humans. Forestry land is heavily trafficked once every few decades; conservation
land is surveyed and studied and sometimes visited by tourists.
The question, like a missing term in the Drake Equation, is when. The L term
captures for-how-long, sure, but only implies a difference between "someone sent
us radio signals for 100 years around 1000 AD" and "someone sent us radio
signals for 100 years around 2000 AD".
I have two cats who hate me. (not their fault, they came from an animal hoarding
situation so they're probably kinda traumatized) They seem to think I'm noisy
and conspicuous and I stink, and to their perceptions I certainly do. They
despise being perceived. I can tell that they're in my house because I can check
every nook and cranny and learn their favorite hidey-holes, and the food I put
out for them gets eaten, and their litter boxes get full. But if this was out in
the woods instead of the artificial and tightly controlled environment of my
home, I would likely not know they're around, just like most hikers don't know
when they're being watched by a mountain lion. The cats hate the places where I
spend time, just as th
9Lauro Langosco4h
Thinking about alignment-relevant thresholds in AGI capabilities. A kind of
rambly list of relevant thresholds:
1. Ability to be deceptively aligned
2. Ability to think / reflect about its goals enough that model realises it
does not like what it is being RLHF’d for
3. Incentives to break containment exist in a way that is accessible /
understandable to the model
4. Ability to break containment
5. Ability to robustly understand human intent
6. Situational awareness
7. Coherence / robustly pursuing it’s goal in a diverse set of circumstances
8. Interpretability methods break (or other oversight methods break)
1. doesn’t have to be because of deceptiveness; maybe thoughts are just too
complicated at some point, or in a different place than you’d expect
9. Capable enough to help us exit the acute risk period
Many alignment proposals rely on reaching these thresholds in a specific order.
For example, the earlier we reach (9) relative to other thresholds, the easier
most alignment proposals are.
Some of these thresholds are relevant to whether an AI or proto-AGI is alignable
even in principle. Short of 'full alignment' (CEV-style), any alignment method
(eg corrigibility) only works within a specific range of capabilities:
* Too much capability breaks alignment, eg bc a model self-reflects and sees
all the ways in which its objectives conflicts with human goals.
* Too little capability (or too little 'coherence') and any alignment method
will be non-robust wrt to OOD inputs or even small improvements in capability
or self-reflectiveness.
8johnswentworth1d
Consider two claims:
* Any system can be modeled as maximizing some utility function, therefore
utility maximization is not a very useful model
* Corrigibility is possible, but utility maximization is incompatible with
corrigibility, therefore we need some non-utility-maximizer kind of agent to
achieve corrigibility
These two claims should probably not both be true! If any system can be modeled
as maximizing a utility function, and it is possible to build a corrigible
system, then naively the corrigible system can be modeled as maximizing a
utility function.
I expect that many peoples' intuitive mental models around utility maximization
boil down to "boo utility maximizer models", and they would therefore
intuitively expect both the above claims to be true at first glance. But on
examination, the probable-incompatibility is fairly obvious, so the two claims
might make a useful test to notice when one is relying on yay/boo reasoning
about utilities in an incoherent way.
3
4kuira1d
Sometimes I have an internal desire different to do something different than
what I think should be done (for example, I might desire to play a game while
also thinking the better choice is to read). I've been experimenting with using
randomness to mediate this. I keep a D20 with me, give each side of the dispute
some odds proportional to the strength of its resolve, and then roll the die.
In theory, this means neither side will overpower the other, and even a small
resolve still has a chance. I'm not sure how useful this is, but it's fun, and
can sort of give me motivation (I've tried to internalize this kind of roll as a
rule not to break without good reason).
Also, when I'm merely deciding between some options, sometimes I'll roll more
casually with equal odds, and it'll help me realize that I already know which it
is I really wanted to do (if I don't like the roll's outcome).
3Dalcy Bremin15h
What's a good technical introduction to Decision Theory and Game Theory for
alignment researchers? I'm guessing standard undergrad textbooks don't include,
say, content about logical decision theory. I've mostly been reading posts on LW
but as with most stuff here they feel more like self-contained blog posts
(rather than textbooks that build on top of a common context) so I was wondering
if there was anything like a canonical resource providing a unified technical /
math-y perspective on the whole subject.