Raymond D

Wiki Contributions

Comments

Sorted by
Raymond D144

Strongly agree that active inference is underrated both in general and specifically for intuitions about agency.

I think the literature does suffer from ambiguity over where it's descriptive (ie an agent will probably approximate a free energy minimiser) vs prescriptive (ie the right way to build agents is free energy minimisation, and anything that isn't that isn't an agent). I am also not aware of good work on tying active inference to tool use - if you know of any, I'd be pretty curious.

I think the viability thing is maybe slightly fraught - I expect it's mainly for anthropic reasons that we mostly encounter agents that have adapted to basically independently and reliably preserve their causal boundaries, but this is always connected to the type of environment they find themselves in.

For example, active inference points to ways we could accidentally build misaligned optimisers that cause harm - chaining an oracle to an actuator to make a system trying to do homeostasis in some domain (like content recommendation) could, with sufficient optimisation power, create all kinds of weird and harmful distortions. But such a system wouldn't need to have any drive for boundary preservation, or even much situational awareness.

So essentially an agent could conceivably persist for totally different reasons, we just tend not to encounter such agents, and this is exactly the kind of place where AI might change the dynamics a lot.

Interesting! I think one of the biggest things we gloss over in the piece in how perception fits into the picture, and this seems like a pretty relevant point. In general the space of 'things that give situational awareness' seems pretty broad and ripe for analysis.

I also wonder how much efficiency gets lost by decoupling observation and understanding - at least in humans, it seems like we have a kind of hierarchical perception where our subjective experience of 'looking at' something has already gone through a few layers of interpretation, giving us basically no unadulterated visual observation, presumably because this is more efficient (maybe in particular faster?). 

I'd be pretty curious to hear about your disagreements if you're willing to share

This seems like a misunderstanding / not my intent. (Could you maybe quote the part that gave you this impression?)


I believe Dusan was trying to say that davidad's agenda limits the planner AI to only writing provable mathematical solutions. To expand, I believe that compared to what you briefly describe, the idea in davidad's agenda is that you don't try to build a planner that's definitely inner aligned, you simply have a formal verification system that ~guarantees what effects a plan will and won't have if implemented.

Raymond D348

Oh interesting! I just had a go at testing it on screenshots from a parallel conversation and it seems like it incorrectly interprets those screenshots as also being of its own conversation. 

So it seems like 'recognising things it has said' is doing very little of the heavy lifting and 'recognising its own name' is responsible for most of the effect.

I'll have a bit more of a play around and probably put a disclaimer at the top of the post some time soon.

The 'reward being chance of winning' stuff changes a bit about how the model generalises if it's playing a game with randomness and conditioned on the upper end - it biases the model towards 'expecting risk to pay off'. Eg/ if the model plays a 1-step game where it either banks 1 point or gets a 1% chance of 10 points, then conditioning on it getting 10 points will cause it to take the lower EV action. But this isn't super relevant.

In general I am pretty confused about how models would generalise strategies out of distribution, and this seems like at least partly an empirical ML / science of deep learning problem. But my intuition is that no, even if the model 'understands' that an action causes a good outcome [ P(outcome |. action) ], it still might not take that action if the action is unlikely on priors [ P(action) ], because the basic driver of its choice is what action it will observe given the outcome [ P(action | outcome) ] which is proportional to both of those quantities (and in fact, probably equal, with some normalising constant P(outcome)).

Re lowering the temperature: the thing the predictive model learns is usually some numbers that you normalise into a probability distribution with a softmax function, and you can easily tweak the softmax to make it favour more likely completions, and disfavour unlikely ones. This would greatly decrease the turn-by-turn chance of disconnecting the button (and therefore overall chance), but probably not impair the model's performance elsewhere much.

Re generalisation - decision transformers don't really have strategies per se, they pick moment to moment, and might be systematically miscalibrated about what they'll do in future timesteps. It is true that they'll have some chance at every timestep, which will add up over time, but if you were actually trying to implement this then you could do things like lowering the temperature, which shouldn't affect performance otherwise.

Re higher conditioning - I think this shouldn't be true. For the sake of argument we can reframe it as a binary outcome, where the model's final return (as a proportion of total possible return) becomes its chance of 'winning'. The thing the model is figuring out is not 'what action leads to me winning', or even 'what action is more likely in worlds where I win than worlds where I lose', it's 'what action do I expect to see from agents that win'. If on turn 1, 99% of agents in the training set voluntarily slap a button that has a 1% chance of destroying them, and then 50% go on to win, as well as 50% of the agents that didn't slap the button, then a DT will (correctly) learn that 'almost all agents which go on to win tend to slap the button on turn 1'.

Re correlation - Sure, I am taking the liberal assumption that there's no correlation in the training data, and indeed a lot of this rests on the training data having a nice structure

Thanks! Yeah this isn't in the paper, it's just a thing I'm fairly sure of which probably deserves a more thorough treatment elsewhere. In the meantime, some rough intuitions would be:

  • delusions are a result of causal confounders, which must be hidden upstream variables
  • if you actually simulate and therefore specify an entire markov blanket, it will screen off all other upstream variables including all possible confounders
  • this is ludicrously difficult for agents with a long history (like a human), but if the STF story is correct, it's sufficient, and crucially, you don't even need to know the full causal structure of reality, just a complete markov blanket
  • any holes in the markov blanket/boundary represent ways for unintended causal pathways to leak through, which separate the predictor's predictions about the effect of an action from the actual causal effect of the action, making the agent appear 'delusional'

I hope we'll have a proper writeup soon; in the meantime let me know if this doesn't make sense.

A slightly sideways argument for interpretability: It's a really good way to introduce the importance and tractability of alignment research

In my experience it's very easy to explain to someone with no technical background that

  • Image classifiers have got much much better (like in 10 years they went from being impossible to being something you can do on your laptop)
  • We actually don't really understand why they do what they do (like we don't know why the classifier says this is an image of a cat, even if it's right)
  • But, thanks to dedicated research, we have begun to understand a bit of what's going on in the black box (like we know it knows what a curve is, we can tell when it thinks it sees a curve)

Then you say 'this is the same thing that big companies are using to maximise your engagement on social media and sell you stuff, and look at how that's going. and by the way did you notice how AIs keep getting bigger and stronger?'

At this point my experience is it's very easy for people to understand why alignment matters and also what kind of thing you can actually do about it.

Compare this to trying to explain why people are worried about mesa-optimisers, boxed oracles, or even the ELK problem, and it's a lot less concrete. People seem to approach it much more like a thought experiment and less like an ongoing problem, and it's harder to grasp why 'developing better regularisers' might be a meaningful goal. 

But interpretability gives people a non-technical story for how alignment affects their lives, the scale of the problem, and how progress can be made. IMO no other approach to alignment is anywhere near as good for this.

My main takeaway from this post is that it's important to distinguish between sending signals and trying to send signals, because the latter often leads to goodharting.

It's tricky, though, because obviously you want to be paying attention to what signals you're giving off, and how they differ from the signals you'd like to be giving off, and sometimes you do just have to try to change them. 

For instance, I make more of an effort now than I used to, to notice when I appreciate what people are doing, and tell them, so that they know I care. And I think this has basically been very good. This is very much not me dropping all effort to signal.

But I think what you're talking about is very applicable here, because if I were just trying to maximise that signal, I would probably just make up compliments, and this would probably be obviously insincere. So I guess the big question is, which things do you stop trying to do?

(Also, I notice I'm now overthinking editing this comment because I've switched gears from 'what am I trying to say' to 'what will people interpret from this'. Time to submit, I guess.)

Load More