Posts

Sorted by New

Wiki Contributions

Comments

The Track Record of Futurists Seems ... Fine

Central piloting. Yep, you're right. We caught this before, but changed it in the wrong branch of the data. Going to make it 'ambiguous'; let me know if that seems wrong. 

I would call it a full miss myself.

 

I still strongly disagree on the commercial interplanetary travel meaning.

If "Cash on Delivery" has that old-timey meaning, it could push a bit to your interpretation, but not enough IMO. 

My reasoning: 

Interplanetary travel is waiting at your front door — 

Actual interplanetary travel, or say a trip on a spaceship, cannot literally be waiting at your front door. So clearly, a metaphorical meaning is intended.

C.O.D. It’s yours when you pay for it.

Here he extends the metaphor. 

But, in your view, that means it's cheap. I disagree, if it was cheap he wouldn't need to say "It's yours when you pay for it". Everything has to be paid for.  If he meant it was cheap, he would just stop at C.O.D. and not say "It’s yours when you pay for it." 

IMO, the "It's yours when you pay for it" clearly means that he expected it to cost enough that it would be a significant barrier to progress (and the prediction is that it is in effect the only barrier to interplanetary travel). I do suspect though that he did intend the reader to pick up your connotation first, for the shock value, and the "It's yours when you pay for it" is intended to shift the reader to the correct interpretation of what he means by C.O.D, i.e., it's meant to be taken literally within the metaphorical context (and by Gricean implicature a large cost is meant) and not as an additional layer of metaphor. 

I suppose the 1965 comments could have been written to retroactively support an interpretation that would make the prediction correct, but I would bet most 1950 readers would have interpreted it as I did.

Also, I note that John C. Wright agrees with my interpretation (in your link to support Heinlein being a "dishonest bugger") (I didn't notice anything in that link about him being a dishonest bugger, though - could you elaborate?). Wright also agrees with me on the central piloting prediction; looking briefly through Wright's comments I didn't see any interpretation of Wright's that I disagreed with (I might quibble with some of Wright's scoring, though probably mostly agree with that too). Unfortunately Wright doesn't comment on whether he thinks Heinlein meant mass space travel as that was a side comment in the lunar retirement discussion and not presented specifically as a separated prediction in Heinlein's original text.

The Track Record of Futurists Seems ... Fine

There's a lot of room for debate on the correctness of the resolutions of these predictions:

e.g. Heinlein in 1949:

Space travel we will have, not fifty years from now, but much sooner. It's breathing down our necks.

This is marked as incorrect, due to the marker assuming that this meant mass space travel, but I wouldn't interpret this as mass space travel unless there's some relevant context I'm missing here - keep in mind that this was from 1949, 8 years before Sputnik.[1]

On the other hand:

All aircraft will be controlled by a giant radar net run on a continent-wide basis by a multiple electronic “brain.”

This is marked as correct, apparently due to autopilot and the "USAF Airborne Command Post"? But I would interpret it as active control of the planes by a centralized computer and mark it as incorrect.[2]

Edited to add: there were a bunch i could have mentioned but want to remark on this one where my interpretation was especially different from the marker's:

Interplanetary travel is waiting at your front door — C.O.D. It’s yours when you pay for it.
 

This is also from 1949. The marker interprets this as a prediction of "Commercial interplanetary travel". I see it rather as a conditional prediction of interplanetary travel (not necessarily commercial), given the willingness to fund it, i.e. a prediction that the necessary technology would be available but not necessarily that it would be funded. If this is the right interpretation, it seems correct to me. Again, I could be completely wrong depending on the context. [3]

  1. ^

    Edited to add: I realized I actually have a copy of Heinlein's "Expanded Universe" which includes "Where To?" and followup 1965 and 1980 comments. In context, this statement comes right in the middle of a discussion of hospitals for old people on the moon, which considerably shifts the interpretation towards it being intended to refer to mass space travel, though if Heinlein were still here he could argue it literally meant any space travel.

  2. ^

    In context, it's not 100% clear that he meant a single computer, though I still think so. But he definitely meant full automation outside of emergency or unusual situations; from his 1980 followup: "But that totally automated traffic control system ought to be built. ... all routine (99.9%+ )takeoffs and landings should be made by computer."

  3. ^

    And now seeing the context, I stand by this interpretation: It's a standalone comment from the original, but Heinlein's 1965 followup includes "and now we are paying for it and the cost is high", confirming that government space travel counted in his view...but, given that he did assert we were paying for it, and interplanetary space travel has not occurred (I interpret the prediction as meaning human space travel), this actually might cut against counting this as a correct prediction.

Murphyjitsu: an Inner Simulator algorithm

Literally yesterday I stumbled upon the obvious-in-retrospect idea that, if an agent selects/generates the plan it thinks is best, then it will tend to be optimistic about the plan it selects/generates, even if it would be unbiased about a random plan.

So, I wonder if this murphyjitsu idea could be in part related to that - the plan you generate is overoptimistic, but then you plan for how the uncertainty in the environment could lead your plan to fail, and now the same bias should overestimate the environment's likelihood of thwarting you.

(Perhaps though this general agent planning bias idea is obvious to everyone though except me, and it's just me who didn't get it until I tried to mentally model how an idea for an AI would work in practice).

(also, it feels like even if this is part of human planning bias, it's not the whole story, and a failure to be specific is a big part of what's going on as noted in the post)

Half-baked AI Safety ideas thread

How do you propose translating this into code?

The idea was supposed to be more optimized for trying to solve alignment than being easy to code. My current (vague - this is the half-backed thread after all) mental model involves

a) training a neural net to be able to understand the necessary concepts to make sense of the intended target it should be aiming at (note: it doesn't necessarily have to understand the full details at first, just the overall concept which it can then refine)

b) using some kind of legibility tool to identify how to "point at" the concepts in the neural net

c) implementing the actual planning and decision making using conventional (non-nn) software that reads and activates the concepts in the neural net in some way

However, in writing this comment reply I realized that the naive way I had been thinking that this could be done any approach regardless of my own thinking being naive (which was something like generating plans and evaluate them according to how well they match the goal implemented by the non-nn software's connections to the neural net) would 

a) be prone to wishful thinking due to only the plans it rates best being relevant, the best-rating plans tending to be ones where it was overoptimistic, and note that extreme levels of optimization on plans could lead to extreme levels of bias, and the bias will occur everywhere in all inputs and intermediate steps of the plan evaluation calculation and not just at the final step, and

b) in the same vein but more worryingly, be potentially vulnerable to the plan generator generating superstimulus-type examples which score highly in the AI's flawed encoding of the concepts while not actually being what humans would actually want. Which is likely inevitable to any neural net and maybe even to anything that extracts concepts from complex inputs.

No full solutions to these problems as of yet ever, though if I may be permitted to fall prey to problem (a) myself, maybe standard robustness approaches could help against (b).

Note my translation devolves to "identify privileged actions that are generally safe, specific to the task" and "don't do things that have uncertain outcome".  Both these terms are easily translated to code.

Neither of those things sound "easily translated to code" to me. What does "safe" mean? What does "specific to the task" mean? How do you classify outcomes as being "uncertain" or not?

Half-baked AI Safety ideas thread

That's actually not what I had in mind at all, though feel free to suggest your interpretation as another idea. 

My idea here is more a pre-requisite to other ideas that I think are needed for alignment than a solution in itself. 

By default, I assume that the AI takes into account all relevant consequences of its action that it's aware of. However, it chooses its actions via an evaluation function that does not merely take into account the  consequences, but also (or potentially only) other factors.

The most important application of this, in my view, is the idea in the comment linked in my parent comment, where the AI cares about the future only via how humans care about the future. In this case, instead of having a utility function seeking particular world states, the utility function values actions conditional on how much currently existing humans would want the actions to be taken out (if they were aware of all relevant info known to the AI). 

Other applications include programming an AI to want to shut down, and not caring that a particular world-state will not be maintained after shutdown.

A potential issue: this can lead the AI to have time-inconsistent preferences, which the AI can then be motivated to make consistent. This is likely to be a particular issue if programming a shutdown, and I think less so given my main idea of caring about what current humans would want. For example, if the AI is initially programmed to maximize what humans currently want at the time of planning/decision making, it could then reprogram itself to always only care about what humans as of the time of reprogramming would want (including after death of said humans, if that occurs), which would fix[1] the time inconsistency. However I think this wouldn't occur because humans would in fact want the AI to continue to shift the time-slice it uses for action assessment to the present (and if we didn't, then the AI fixing it would be in some sense the "correct" decision for the benefit of our current present selves, though selfish on our part).

  1. ^

    Apart from the time inconsistency resulting from it not yet knowing what humans actually want. However, fixing this aspect (by e.g. fixating on its current best guess world state that it thinks humans would want) should be lower E.V. than continuing to update on receiving new information, if the action evaluator takes into account: (1) the uncertainty in what humans would want, (2) the potential to obtain further information on what humans would want, (3), the AI's potential future actions, (4) the consequences of such actions in relation to what humans want and (5) the probabilistic interrelationships between these things (so that the AI predicts that if it continues to use new information to update its assessment of what humans would want, it will take actions that better fit what humans actually would want, which on average better serves what humans would want than if it goes with its current best guess). This is a fairly tall order which is part of why I want the AI's action evaluator to plug into the AI's main world-model to make this assessment (which I should add as another half-baked idea)

Do you consider your current, non-superhuman self aligned with “humanity” already?

I am an unironic supporter of a version of the Copenhagen Interpretation of Ethics - there's no general obligation to be aligned with humanity, but if you choose to massively effect the future of humanity, e.g. via creating superhuman AI, you then have a moral duty to carry out that action in a manner approximating if you were aligned. 

Half-baked AI Safety ideas thread

An aligned AI should not care about the future directly, only via how humans care about the future. I see this as necessary in order to prevent the AI, once powerful enough, from replacing/reprogramming humans with utility monsters.

Prerequisite: use a utility function that applies to actions, not world-states.

Half-baked AI Safety ideas thread

If the AI is a long term planner seeking particular world states, then I am concerned that once it achieves the wireheading objective, it is incentivized to maintain the situation, which may be best achieved if any humans who might decide to erase the writing are dead.

A suggestion: if the AI has a  utility function that applies to actions not world states then you can assign high utility to the combined action of writing "Bill is a poo poo head" in 10m high letters into Mt Everest and then shutting itself down.

Note: this does not solve the problem of the AI actively seeking this out instead of doing what it's supposed to. 

To do the latter, you could try something like: 

  1. Have the action evaluator ignore the wirehead action unless it is "easy" in some sense to achieve given the AI and world's current state, and
  2. Have the AI assume that the wirehead action will always be ignored in the future

Unfortunately, I don't know how one would do (2) reliably, and if (2) fails, (1) would lead the AI to actively avoid the tripwire (as activating it would be bad for the AI's current plans given that the wirehead action is currently being ignored).

Half-baked AI Safety ideas thread

If an AI uses a utility function, have that utility function apply to actions, not world-states. 

Note: when evaluating an action, you can still take into account the consequences of the action (e.g. in regards to how this affects whether humans would want you to do the action).

The utility applying to actions not world-states enables things like assigning high utility to the AI shutting itself down.

edit: separated from the idea (which depends on this) that an AI should not care about the future directly

Let's See You Write That Corrigibility Tag

To generalize:

Minimal squishiness. You probably need something like a neural net in order to create a world-model for the AI to use, but could probably do everything else using carefully reviewed human-written code that "plugs in" to concepts in the world-model. (Probably best to have something coded in for what to do if a concept you plugged into disappears/fragments when the AI gets more information).

Abstract goals. The world-model needs enough detail to be able to point to the right concept (e.g. human-value related goal), but as long as it does so the AI doesn't necessarily need to know everything about human values, it will just be uncertain and act under uncertainty (which can include risk-aversion measures, asking humans etc.). 

Present-groundedness. The AI's decision-making procedure should not care about the future directly, only via how humans care about the future. Otherwise it e.g. replaces humans with utility monsters.

Load More