abramdemski

Sequences

Pointing at Normativity
Consequences of Logical Induction
Partial Agency
Alternate Alignment Ideas
Filtered Evidence, Filtered Arguments
CDT=EDT?
Embedded Agency
Hufflepuff Cynicism

Comments

Sorted by

Admittedly, one can try to squish beliefs and desires into the same framework. The Active Inference people do that. Does LI do that too? 

No. LI defines a notion of logically uncertain variable, which can be used to represent desires. There are also other ways one could build agents out of LI, such as doing the active inference thing.

As I mentioned in the post, I'm agnostic about such things here. We could be building """purely epistemic""" AI out of LI, or we could be deliberately building agents. It doesn't matter very much, in part because we don't have a good notion of purely epistemic

  • Any learning system with a sufficiently rich hypothesis space can potentially learn to behave agentically (whether we want it to or not, until we have anti-inner-optimizer tech), so we should still have corrigibility concerns about such systems.
  • In my view, beliefs are a type of decision (not because we smoosh beliefs and values together, but rather because beliefs can have impacts on the world if the world looks at them) which means we should have agentic concerns about beliefs.
  • Also, it is easy for end users to build agentlike things out of belieflike things by making queries about how to accomplish things. Thus, we need to train epistemic systems to be responsible about how such queries are answered (as is already apparent in existing chatbots).

I think I've articulated a number of concrete subgoals that require less philosophical skill (they can be approached as math problems). However, in the big picture, novel tiling theorems require novel ideas. This requires philosophical skill.

I think there are some deeper insights around inner optimization that you are missing that would make you more pessimistic here. "Unknown Algorithm" to me means that we don't know how to rule out the possibility of inner agents which have opinions about recursive self-improvement. Part of it is that we can't just think about what it "converges to" (convergence time will be too long for interesting learning systems).

I agree that Solomonoff’s epistemology is noncentral in the way you describe, but I don't think it impacts my points very much; replace Solomonoff with whatever epistemic theory you like. It was just a convenient example.

(Although I expect defenders of Solomonoff to expect the program bits to be meaningful; and I somewhat agree. It's just that the theory doesn't address the meaning there, instead treating programs more like black-box predictors.)

In my view, meaning is the property of being optimized to adhere to some map-territory relationship. However, this optimization itself must always occur within some model (it provides the map-territory relationship to optimize for). In the context of Solomonoff Induction, this may emerge from the incentive to predict, but it is not easy to reason about.

In some sense, reality isn't made of bits, propositions, or any such thing; it is of unknowable type. However, we always describe it via terms of some type (a language).

I'm no longer sure where the disagreement lies, if any, but I still feel like the original post overstates things.

How would you become confident that a UANFSI approach was NFSI?

I have lost interest in the Löbian approach to tiling, because probabilistic tiling results seem like they can be strong enough and with much less suspicious-looking solutions. Expected value maximization is a better way of looking at agentic behavior anyway. Trying to logically prove some safety predicate for all actions seems like a worse paradigm than trying to prove some safety properties for the system overall (including proving that those properties tile under as-reasonable-as-possible assumptions, plus sanity-checking what happens when those assumptions aren't precisely true). 

I do think Löb-ish reasoning still seems potentially important for coordination and cooperation, which I expect to feature in important tiling results (if this research program continues to make progress). However, I am optimistic about replacing Löb's Theorem with Payor's Lemma in this context.

I don't completely discount the pivotal-act approach, but I am currently more optimistic about developing safety criteria & designs which could achieve some degree of consensus amongst researchers, and make their way into commercial AI, perhaps through regulation.

Yeah, I totally agree. This was initially a quick private message to someone, but I thought it was better to post it publicly despite the inadequate explanations. I think the idea deserves a better write-up.

It seems to me that the importance and interaction of these different types of power in the future depends a lot on our choices now, ie, what kind of future we shape. Hierarchies could get smashed in one way or another, making John's prediction correct, or we could engineer a future that evolves from the present more smoothly, in which case you'd be correct.

One thing I don't understand / don't agree with here is the move from propositions to models. It seems to me that models can be (and usually are) understood in terms of propositions.

For example, Solomonoff understands models as computer programs which generate predictions. However, computer programs are constructed out of bits, which can be understood as propositions. The bits are not very meaningful in isolation; the claim "program-bit number 37 is a 1" has almost no meaning in the absence of further information about the other program bits. However, this isn't much of an issue for the formalism.

Similarly, I expect that any attempt to formally model "models" can be broken down into propositions. EG, if someone claimed that humans understand the world in terms of systems of differential equations, this would still be well-facilitated by a concept of propositions (ie, the equations).

It seems to me like a convincing abandonment of propositions would have to be quite radical, abandoning the idea of formalism entirely. This is because you'd have to explain why your way of thinking about models is not amenable to a mathematical treatment (since math is commonly understood in terms of propositions).

So (a) I'm not convinced that thinking in terms of propositions makes it difficult to think in terms of models; (b) it seems to me that refusing to think in terms of propositions would make it difficult to think in terms of models.

"X is false" has to be modeled as something that is value 1 if and only if X is value 0, but continuously decreases in value as X continuously increases in value. The simplest formula is value(X is false) = 1-value(X). However, we can made "sharper" formulas which diminish in value more rapidly as X increases in value. Hartry Field constructs a hierarchy of such predicates which he calls "definitely false", "definitely definitely false", etc.

Proof systems for the logic should have the property that sentences are derivable only when they have value 1; so "X is false" or "X is definitely false" etc all share the property that they're only derivable when X has value zero.

Load More