So the idea is to use "Artificial Intention" to specifically speak of the subset of concerns about what outcomes an artificial system will try to steer for, rather than the concerns about the world-states that will result in practice from the interaction of that artificial system's steering plus the steering of everything else in the world?
Makes sense. I expect it's valuable to also have a term for the bit where you can end up in a situation that nobody was steering for due to the interaction of multiple systems, but explicitly separating those concerns is probably a good idea.
I think the issues you point out with the "alignment" name are real issues. That said, the word "intent" comes with its own issues.
Intention doesn't have to be conscious or communicable. It is just a preference for some futures over others, inferred as an explanation for behavior that chooses some future over others. Like, even single celled organisms have basic intentions if they move towards nutrients or away from bad temperatures.
I don't think "intention" is necessarily the best word for this unless you go full POSIWID. A goose does not "intend" to drag all vaguely egg-shaped objects to her nest and sit on them, in the sense that I don't think geese prefer sitting on a clutch of eggs over a clutch of eggs, a lightbulb, and a wooden block. And yet that is the expressed behavior anyway, because lightbulbs were rare and eggs that rolled out of the nest common in the ancestral environment.
I think "artificial system fitness-for-purpose" might come closer to gesturing about what "AI alignment" is pointing at (including being explicit about the bit that it's a 2-place term), but at the cost of being extremely not catchy.
and freezing bank accounts of people whose only crime was donating money to the protesters
Slightly off topic, but is this a thing that was actually verified to have happened? The only case I had heard of was the "Brianne from Chilliwack" one that seemed not to pan out as real as far as I can tell.
(Asking because at the time there was quite a bit of discussion about whether the overreach was "trying to punish protestors directly" or "deliberately trying to create a chilling effect on any support for protests")
There's no way they could interoperate without massive computing and building a new model.
It historically has been shown that one can interpolate between a vision model and a language model[1]. And, more recently, it has been shown that yes, you can use a fancy transformer to map between intermediate representations in your image and text models, but you don't have to do that and in fact it works fine[2] to just use your frozen image encoder, then a linear mapping (!), then your text decoder.
I personally expect a similar phenomenon if you use the first half of an English-only pretrained language model and the second half of a Japanese-only pretrained language model -- you might not literally be able to use a linear mapping as above, but I expect you could use a quite cheap mapping. That said, I am not aware of anyone who has actually attempted the thing so I could be wrong that the result from [2] will generalize that far.
(Aside: was that a typo, or did you intend to say "compute" instead of "computing power"?)
Yeah, I did mean "computing power" there. I think it's just a weird way that people in my industry use words.[3]
Example: DeepMind's Flamingo, which demonstrated that it was possible at all to take pretrained language model and a pretrained vision model, and glue them together into a multimodal model, and that doing so produced SOTA results on a number of benchmarks. See also this paper, also out of DeepMind.
For example, see this HN discussion about it. See also the "compute" section of this post, which talks about things that are "compute-bound" rather than "bounded on the amount of available computing power".
Why waste time use lot word when few word do trick?
Your colab's "Check it can speak French" section seems to be a stub.
Fixed.
Note that all of the activation addition coefficients are 1, and your code generates 56 additions, so we're adding a "coefficient 56" steering vector to forward passes. This should probably be substantially smaller. I haven't examined this yet.
Updated the colab to try out this approach with a range of coefficients.
However, neither the steered nor the unsteered French is particularly coherent. I think GPT-2-XL and GPT-2-small are both incapable of actually speaking complicated French, and so we might look into larger models.
Confirmed that GPT-2-XL seems to also be unable to speak French. Continuing to scale up from there, I find that gpt-neo-2.7B can kinda-sorta speak sensical French. GPT-J-6B OOMs on me on Colab Pro, but I think I may be able to do some hackery with init_empty_weights() / load_checkpoint_and_dispatch(), or, failing that, use an 8 bit or even 4 bit version of GPT-J-6B -- I honestly doubt the loss in precision really matters for algebraic value editing, considering that the level of precision starts off at "take the difference between two things that seem like they might plausibly have a similar relationship".
Update: I have gotten GPT-J-6B up and running on Colab (link, it's a new one), and working alright with TransformerLens and montemac's algebraic_value_editing repo. GPT-J-6B is capable of speaking French, so I think this is a good model to do testing on. Now I'm fighting with finding a good coefficient / position to reproduce the original Hate->Love vector result.
Let's say we have a language model that only knows how to speak English and a second one that only knows how to speak Japanese. Is your expectation that there would be no way to glue these two LLMs together to build an English-to-Japanese translator such that training the "glue" takes <1% of the compute used to train the independent models?
I weakly expect the opposite, largely based on stuff like this, and based on playing around with using algebraic value editing to get an LLM to output French in response to English (but also note that the LLM I did that with knew English and the general shape of what French looks like, so there's no guarantee that result scales or would transfer the way I'm imagining).
I think we also care about how fast it gets arbitrarily capable. Consider a system which finds an approach which can measure approximate actions-in-the-world-Elo (where an entity with an advantage of 200 on their actions-in-the-world-Elo score will choose a better action 76% of the time), but it's using a "mutate and test" method over an exponentially large space, such that the time taken to find the next 100 point gain takes 5x as long, and it starts out with an actions-in-the-world-Elo 1000 points lower than an average human with a 1 week time-to-next-improvement. That hypothetical system is technically a recursively self-improving intelligence that will eventually reach any point of capability, but it's not really one we need to worry that much about unless it finds techniques to dramatically reduce the search space.
Like I suspect that GPT-4 is not actually very far from the ability to come up with a fine-tuning strategy for any task you care to give it, and to create a simple directory of fine-tuned models, and to create a prompt which describes to it how to use that directory of fine-tuned models. But fine-tuning seems to take an exponential increase in data for each linear increase in performance, so that's still not a terribly threatening "AGI".
I just tried that, and it kinda worked. Specifically, it worked to get gpt2-small to output text that structurally looks like French, but not to coherently speak French.
Although I then just tried feeding the base gpt2-small a passage in French, and its completions there were also incoherent, so I think it's just that that version hasn't seen enough French to speak it very well.
I found an even dumber approach that works. The approach is as follows:
n.i from 0 to n, make an English->French sentence by taking the first i fragments in English and the rest in French. The resulting sentences look likeExample output: for the prompt
He became Mayor in 1957 after the death of Albert Cobo, and was elected in his own right shortly afterward by a 6:1 margin over his opponent. Miriani was best known for completing many of the large-scale urban renewal projects initiated by the Cobo administration, and largely financed by federal money. Miriani also took strong measures to overcome the growing crime rate in Detroit.
here are some of the outputs the patched model generates
...overcome the growing crime rate in Detroit. "Les défenseilant sur les necesite dans ce de l'en nouvieres éché de un enferrerne réalzation
...overcome the growing crime rate in Detroit. The éviteurant-déclaratement de la prise de découverte ses en un ouestre : neque nous neiten ha
...overcome the growing crime rate in Detroit. Le deu précite un événant à lien au raison dans ce qui sont mête les través du service parlentants
...overcome the growing crime rate in Detroit. Il n'en fonentant 'le chine ébien à ce quelque parle près en dévouer de la langue un puedite aux cities
...overcome the growing crime rate in Detroit. Il n'a pas de un hite en tienet parlent précisant à nous avié en débateurante le premier un datanz.
Dropping the temperature does not particularly result in more coherent French. But also passing a French translation of the prompt to the unpatched model (i.e. base gpt2-small) results in stuff like
Il est devenu maire en 1957 après la mort d'Albert Cobo[...] de criminalité croissant à Detroit. Il est pouvez un información un nuestro riche qui ont la casa del mundo, se pueda que les criques se régions au cour
That response translates as approximately
<french>It is possible to inform a rich man who has the </french><spanish>house of the world, which can be</spanish><french>creeks that are regions in the heart</french>
So gpt2-small knows what French looks like, and can be steered in the obvious way to spontaneously emit text that looks vaguely like French, but it is terrible at speaking French.
You can look at what I did at this colab. It is a very short colab.
Making it harder to legally use models accomplishes two things:
Consider the situation with opiates in the US: our attempts to erect legal barriers for people obtaining opiates has indeed reduced the number of people legally obtaining opiates, and probably even reduced total opiate consumption, but at the cost that a lot of people were driven to buy their opiates illegally instead of going through medical channels.
I don't expect computing power sufficient to train powerful models to be easier to control than opiates, in worlds where doom looks like rapid algorithmic advancements that decrease the resource requirements to train and run powerful models by orders of magnitude.