Call me Oliver or Oly - I don't mind which.
I'm particularly interested in sustainable collaboration and the long-term future of value. I'd love to contribute to a safer and more prosperous future with AI! Always interested in discussions about axiology, x-risks, s-risks.
I'm currently (2022) just embarking on a PhD in AI in Oxford, and also spend time in (or in easy reach of) London. Until recently I was working as a senior data scientist and software engineer, and I've been doing occasional AI alignment research with SERI.
I enjoy meeting new perspectives and growing my understanding of the world and the people in it. I also love to read - let me know your suggestions! In no particular order, here are some I've enjoyed recently
Cooperative gaming is a relatively recent but fruitful interest for me. Here are some of my favourites
People who've got to know me only recently are sometimes surprised to learn that I'm a pretty handy trumpeter and hornist.
Strong agree with the need for nuance. 'Model' is another word that gets horribly mangled a lot recently.
I think the more sensible uses of the word 'agent' I've come across are usually referring to the assemblage of a policy-under-training plus the rest of the shebang: learning method, exploration tricks of one kind or another, environment modelling (if any), planning algorithm (if any) etc. This seems more legit to me, though I still avoid using the word 'agent' as far as possible for similar reasons (discussed here (footnote 6) and here).
Similarly to Daniel's response to 'reward is not the optimization target' I think you can be more generous in your interpretation of RL experts' words and read less error in. That doesn't mean that more care in communication and terminology would be preferable, which is a takeaway I strongly endorse.
My strong guess is that AIs won't by default care about other sentient minds
nit: this presupposes that the de novo mind is itself sentient, which I think you're (rightly) trying to leave unresolved (because it is unresolved). I'd write
My strong guess is that AIs won't by default care about sentient minds, even if they are themselves sentient
(Unless you really are trying to connect alignment necessarily with building a sentient mind, in which case I'd suggest making that more explicit)
I agree that concrete toy demonstrations are one good communication tool! I also agree that demonstrating the capability to act on unpluggability, and discussing/demonstrating the motive to do so, are also useful.
unpluggability kind of falls into a big set of stories where capabilities generalize further than safety
Interesting, I think I see what you mean. This applies for e.g. some kinds of control over active defenses (weapons, propaganda etc.) and many paths to replication. But foundationality (dependence), imperceptibility (of harmful ends), and robustness don't seem to fit this pattern, to me. They're properties which a capable system might aim towards, but not capabilities per se, and they can obviously arise through other means too (e.g. accidental or deliberate human activity).
Simply, the properties I'm pointing at here have in common that they're mechanisms of un-unpluggability. They can arise through exertion of capability, they can be appreciated by intelligent and situationally-aware systems, but they are not intrinsically tied to those. They're systemic properties which one thing has in relation to its context (i.e. an AI system could have in relation to society).
Right, this sounds somewhat less like un-unpluggability and more like (reasoning?) capabilities or the instrumental incorrigibility motives I pointed to at the start as a complementary insight. In particular applied to unboxing/escape - perhaps tied to expansionism (replication) of a system which is not intended to do so.
I agree this is plausible - though in the foundationality/dependency bucket I also wouldn't rule out any of
This was a fantastic read. Among my top three (at least) on Goodhart!
Stupid simple observation: if you could get enough independent[1] evaluations of you could smooth out heavy tails by ensembling (by central limit theorem).
actually independent, not like asking lots of humans to 'independently' rate something, which is obviously correlated in important ways - I think this condition is very hard to achieve in reality ↩︎
Really enjoyed this post, both aesthetically (I like evolution and palaeontology, and obviously AI things!) and as a motivator for some lines of research and thought.
I had a go at one point connecting natural selection with gradient descent which you might find useful depending on your aims.
I also collected some cases of what I think are potentially convergent properties of 'deliberating systems', many of them natural, and others artificial. Maybe you'll find those useful, and I'd love to know to what extent you agree or disagree with the concepts there.
One hypothesis is that runaway selection for social skills leads to intelligence.
I realise you're explicitly not claiming that this has been the only route to intelligence, but I wanted to insert a counterexample here: cephalopods (octopus, squid, ...) are generally regarded as highly intelligent, but as far as I know there are few or no social species. They don't even interact socially with their own young, unlike orangutans, another example of an otherwise usually solitary intelligent species.
Oh, I mean to refer to the rest of the comment
and taking that sort of reading as a kind of innocent until proven guilty.
I'll confess I was in a meeting yesterday and someone (a PhD student) made the obvious error of considering RL prerequisite to agentiness, perhaps (but not definitely) a consequence of exactly the conflation you're referring to in this post. Several people in the room were able to clarify. The context was a crossover between a DL lab (this mentioned PhD student's) and the safety research community in Oxford (me et al).