My current research interests:
- alignment in systems which are complex and messy, composed of both humans and AIs?
- actually good mathematized theories of cooperation and coordination
- active inference
- bounded rationality
Research at Alignment of Complex Systems Research Group (acsresearch.org), Centre for Theoretical Studies, Charles University in Prague. Research fellow Future of Humanity Institute, Oxford University
Previously I was a researcher in physics, studying phase transitions, network science and complex systems.
I feel somewhat frustrated by execution of this initiative. As far as I can tell, no new signatures are getting published since at least one day before the public announcement. This means even if I asked someone famous (at least in some subfield or circles) to sign, and the person signed, their name is not on the list, leading to understandable frustration of them. (I already got a piece of feedback in the direction "the signatories are impressive, but the organization running it seems untrustworthy")
Also if the statement is intended to serve as a beacon, allowing people who have previously been quiet about AI risk to connect with each other, it's essential for signatures to be published. It's nice that Hinton et al. signed, but for many people in academia it would be actually practically useful to know who from their institution signed - it's unlikely that most people will find collaborators in Hinton, Russell or Hassabis.
I feel even more frustrated because this is second time where similar effort is executed by xrisk community while lacking basic operational competence consisting in the ability to accept and verify signatures. So, I make this humble appeal and offer to the organizers of any future public statements collecting signatures: if you are able to write a good statement and secure the endorsement of some initial high-profile signatories, but lack the ability to accept, verify and publish more than a few hundreds names, please reach out to me - it's not that difficult to find volunteers for this work.
I don't think the way you imagine perspective inversion captures typical ways how to arrive at e.g. 20% doom probability. For example, I do believe that there are multiple good things which can happen/be true, decrease p(doom) and I put some weight on them
- we do discover some relatively short description of something like "harmony and kindness"; this works as an alignment target
- enough of morality is convergent
- AI progress helps with human coordination (could be in costly way, eg warning shot)
- it's convergent to massively scale alignment efforts with AI power, and these solve some of the more obvious problems
I would expect prevailing doom conditional on only small efforts to avoid it, but I do think the actual efforts will be substantial, and this moves the chances to ~20-30%. (Also I think most of the risk comes from not being able to deal with complex systems of many AIs and economy decoupling from humans, and single-single alignment to be solved sufficiently to prevent single system takeover by default.)
It's much more natural way how to think about it (cf eg TE Janes, Probability theory, examples in Chapter IV)
In this specific case of evaluating hypothesis, the distance in the logodds space indicates the strength the evidence you would need to see to update. Close distance implies you don't that much evidence to update between the positions (note the distance between 0.7 and 0.2 is closer than 0.9 and 0.99). If you need only a small amount of evidence to update, it is easy to imagine some other observer as reasonable as you had accumulated a bit or two somewhere you haven't seen.
Because working in logspace is way more natural, it is almost certainly also what our brains do - the "common sense" is almost certainly based on logspace representations.
As a minor nitpick, 70% likely and 20% are quite close in logodds space, so it seems odd you think what you believe is reasonable and something so close is "very unreasonable".
Judging in an informal and biased way, I think some impact is in the public debate being marginally a bit more sane - but this is obviously hard to evaluate.
To what extent more informed public debate can lead to better policy is to be seen; also, unfortunately, I would tend to glomarize over discussing the topic directly with policymakers.
There are some more proximate impacts like we (ACS) are getting a steady stream of requests for collaboration or people wanting to work with us, but we basically don't have capacity to form more collaborations, and don't have capacity to absorb more people unless exceptionally self-guided.
It is testable in this way for OpenAI, but I can't skip the tokenizer and embeddings and just feed vectors to GPT3. Someone can try that with ' petertodd' and GPT-J. Or, you can simulate something like anomalous tokens by feeding such vectors to some of the LLaAMA (maybe I'll do, just don't have the time now).
I did some some experiments with trying to prompt "word component decomposition/ expansion". They don't prove anything and can't be too fine-grained, but the projections shown intuitively make sense
davinci-instruct-beta, T=0:
Add more examples of word expansions in vector form
'bigger'' = 'city' - 'town'
'queen'- 'king' = 'man' - 'woman' '
bravery' = 'soldier' - 'coward'
'wealthy' = 'business mogul' - 'minimum wage worker'
'skilled' = 'expert' - 'novice'
'exciting' = 'rollercoaster' - 'waiting in line'
'spacious' = 'mansion' - 'studio apartment'
1.
' petertodd' = 'dictator' - 'president'
II.
' petertodd' = 'antagonist' - 'protagonist'
III.
' petertodd' = 'reference' - 'word'
I don't know / talked with a few people before posting, and it seems opinions differ.
We also talk about e.g. "the drought problem" where we don't aim to get landscape dry.
Also as Kaj wrote, the problem also isn't how to get self-unaligned
Some speculative hypotheses, one more likely and mundane, one more scary, one removed
1. Nature of embeddings
Do you remember word2vec (Mikolov et al) embeddings?
Stuff like (woman-man)+king = queen works in embeddings vector space.
However, the vector (woman-man) itself does not correspond to a word, it's more something like "the contextless essence of femininity". Combined with other concepts, it moves them in a feminine direction. (There was a lot of discussion how the results sometimes highlight implicit sexism in the language corpus).
Note such vectors are closer to the average of all words - i.e. the (woman-man) has roughly zero projections of direction like "what language it is" or "is this a noun" and most other directions in which normal words have large projection
Based on this post, intuitively it seem petertodd embedding could be something like "antagonist - protagonist" + 0.2 "technology - person + 0.2 * "essence of words starting by the letter n"....
...a vector in the embedding space which itself does not correspond to a word, but has high scalar products with words like adversary. And plausibly lacks some crucial features which make it possible to speak the world.
Most of the examples the post seem consistent with this direction-in-embedding space. E.g. imagine a completion of
Tell me the story of "unspeakable essence of antagonist - protagonist"+ 0.2 "technology - person" and ...
What could be some other way to map unspeakeable to speakable? I did a simple experiment not done in the post, with davinci-instruc-beta, simply trying to translate ' petertodd' to various languages. Intuitively, translations often have the feature that what does not precisely correspond to a word in one language does in the other
English: Noun 1. a person who opposes the government
Czech: enemy
French: le négationniste/ "the Holocaust denier"
Chinese: Feynman
...
Why would embedding of anomalous tokens be more like to be this type of vectors, than normal words? Vectors like "woman-man" are closer to the centre of the embedding space, similar to how I imagine anomalous tokens.
In training, embeddings of words drift from origin. Embedding of the anomalous tokens do much less, making them somewhat similar to the "non-word vectors"
Alternatively if you just have a random vector, you mostly don't hit a word.
Also, I think this can explain part of the model behaviour where there is some context. Eg implicitly, in case of the ChatGPT conversations, there is the context of "this a conversation with a language model". If you mix hallucinations with AIs in the context with "unspeakable essence of antagonist - protagonist + tech" ... maybe you get what you see?
Technical sidenote is tokens are not exactly words from word2vec... but I would expect to get roughly word embedding type of activations in the next layers
1I. Self-reference
In Why Simulator AIs want to be Active Inference AIs we predict that GPTs will develop some understanding of self / self-awareness. The word 'self' is not the essence of the self-reference, which is just a ...pointer in a model.
When such self-references develop, in principle they will be represented somehow, and in principle, it is possible to imagine that such representation could be triggered by some pattern of activations, triggered by an unused token.
I doubt this is the case - I don't think GPT3 is likely to have this level of reflectivity, and don't think it is very natural that when developed, this abstraction would be triggered by an embedding of anomalous token.
Thanks for the links!
What I had in mind wasn't exactly the problem 'there is more than one fixed point', but more of 'if you don't understand what did you set up, you will end in a bad place'.
I think an example of a dynamic which we sort of understand and expect to reasonable by human standards is putting humans in a box and letting them deliberate about the problem for thousands of years. I don't think this extends to eg. LLMs - if you tell me you will train a sequence of increasingly powerful GPT models and let them deliberate for thousands of human-speech-equivalent years and decide about the training of next-in-the sequence model, I don't trust the process.
Thanks for the reply. Also for the work - it's great signatures are added - before I've checked bottom of the list and it seemed it's either same or with very few additions.
I do understand verification of signatures requires some amount of work. In my view having more people (could be volunteers) to process the initial expected surge of signatures fast would have been better; attention spent on this will drop fast.